This patch adds a bpf self-test to cover BPF_LWT_ENCAP_IP mode
in bpf_lwt_push_encap.
Covered:
- encapping in LWT_IN and LWT_XMIT
- IPv4 and IPv6
A follow-up patch will add GSO and VRF-enabled tests.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch copies changes in bpf.h done by a previous patch
in this patchset from the kernel uapi include dir into tools
uapi include dir.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch builds on top of the previous patch in the patchset,
which added BPF_LWT_ENCAP_IP mode to bpf_lwt_push_encap. As the
encapping can result in the skb needing to go via a different
interface/route/dst, bpf programs can indicate this by returning
BPF_LWT_REROUTE, which triggers a new route lookup for the skb.
v8 changes: fix kbuild errors when LWTUNNEL_BPF is builtin, but
IPV6 is a module: as LWTUNNEL_BPF can only be either Y or N,
call IPV6 routing functions only if they are built-in.
v9 changes:
- fixed a kbuild test robot compiler warning;
- call IPV6 routing functions via ipv6_stub.
v10 changes: removed unnecessary IS_ENABLED and pr_warn_once.
v11 changes: fixed a potential dst leak.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Proxy ip6_route_input via ipv6_stub, for later use by lwt bpf ip encap
(see the next patch in the patchset).
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds handling of GSO packets in bpf_lwt_push_ip_encap()
(called from bpf_lwt_push_encap):
* IPIP, GRE, and UDP encapsulation types are deduced by looking
into iphdr->protocol or ipv6hdr->next_header;
* SCTP GSO packets are not supported (as bpf_skb_proto_4_to_6
and similar do);
* UDP_L4 GSO packets are also not supported (although they are
not blocked in bpf_skb_proto_4_to_6 and similar), as
skb_decrease_gso_size() will break it;
* SKB_GSO_DODGY bit is set.
Note: it may be possible to support SCTP and UDP_L4 gso packets;
but as these cases seem to be not well handled by other
tunneling/encapping code paths, the solution should
be generic enough to apply to all tunneling/encapping code.
v8 changes:
- make sure that if GRE or UDP encap is detected, there is
enough of pushed bytes to cover both IP[v6] + GRE|UDP headers;
- do not reject double-encapped packets;
- whitelist TCP GSO packets rather than block SCTP GSO and
UDP GSO.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap BPF helper.
It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN and
BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).
This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).
v7 changes:
- added a call skb_clear_hash();
- removed calls to skb_set_transport_header();
- refuse to encap GSO-enabled packets.
v8 changes:
- fix build errors when LWT is not enabled.
Note: the next patch in the patchset with deal with GSO-enabled packets,
which are currently rejected at encapping attempt.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds all needed plumbing in preparation to allowing
bpf programs to do IP encapping via bpf_lwt_push_encap. Actual
implementation is added in the next patch in the patchset.
Of note:
- bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT
prog types in addition to BPF_PROG_TYPE_LWT_IN;
- if the skb being encapped has GSO set, encapsulation is limited
to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6);
- as route lookups are different for ingress vs egress, the single
external bpf_lwt_push_encap BPF helper is routed internally to
either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs,
depending on prog type.
v8 changes: fixed a typo.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In sctp_stream_init(), after sctp_stream_outq_migrate() freed the
surplus streams' ext, but sctp_stream_alloc_out() returns -ENOMEM,
stream->outcnt will not be set to 'outcnt'.
With the bigger value on stream->outcnt, when closing the assoc and
freeing its streams, the ext of those surplus streams will be freed
again since those stream exts were not set to NULL after freeing in
sctp_stream_outq_migrate(). Then the invalid-free issue reported by
syzbot would be triggered.
We fix it by simply setting them to NULL after freeing.
Fixes: 5bbbbe32a4 ("sctp: introduce stream scheduler foundations")
Reported-by: syzbot+58e480e7b28f2d890bfd@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jianlin reported a panic when running sctp gso over gre over vlan device:
[ 84.772930] RIP: 0010:do_csum+0x6d/0x170
[ 84.790605] Call Trace:
[ 84.791054] csum_partial+0xd/0x20
[ 84.791657] gre_gso_segment+0x2c3/0x390
[ 84.792364] inet_gso_segment+0x161/0x3e0
[ 84.793071] skb_mac_gso_segment+0xb8/0x120
[ 84.793846] __skb_gso_segment+0x7e/0x180
[ 84.794581] validate_xmit_skb+0x141/0x2e0
[ 84.795297] __dev_queue_xmit+0x258/0x8f0
[ 84.795949] ? eth_header+0x26/0xc0
[ 84.796581] ip_finish_output2+0x196/0x430
[ 84.797295] ? skb_gso_validate_network_len+0x11/0x80
[ 84.798183] ? ip_finish_output+0x169/0x270
[ 84.798875] ip_output+0x6c/0xe0
[ 84.799413] ? ip_append_data.part.50+0xc0/0xc0
[ 84.800145] iptunnel_xmit+0x144/0x1c0
[ 84.800814] ip_tunnel_xmit+0x62d/0x930 [ip_tunnel]
[ 84.801699] gre_tap_xmit+0xac/0xf0 [ip_gre]
[ 84.802395] dev_hard_start_xmit+0xa5/0x210
[ 84.803086] sch_direct_xmit+0x14f/0x340
[ 84.803733] __dev_queue_xmit+0x799/0x8f0
[ 84.804472] ip_finish_output2+0x2e0/0x430
[ 84.805255] ? skb_gso_validate_network_len+0x11/0x80
[ 84.806154] ip_output+0x6c/0xe0
[ 84.806721] ? ip_append_data.part.50+0xc0/0xc0
[ 84.807516] sctp_packet_transmit+0x716/0xa10 [sctp]
[ 84.808337] sctp_outq_flush+0xd7/0x880 [sctp]
It was caused by SKB_GSO_CB(skb)->csum_start not set in sctp_gso_segment.
sctp_gso_segment() calls skb_segment() with 'feature | NETIF_F_HW_CSUM',
which causes SKB_GSO_CB(skb)->csum_start not to be set in skb_segment().
For TCP/UDP, when feature supports HW_CSUM, CHECKSUM_PARTIAL will be set
and gso_reset_checksum will be called to set SKB_GSO_CB(skb)->csum_start.
So SCTP should do the same as TCP/UDP, to call gso_reset_checksum() when
computing checksum in sctp_gso_segment.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier says:
====================
net: phy: Add 2.5G/5GBASET PHYs support
The 802.3bz standard defines 2 modes based on the NBASET alliance work
that allow to use 2.5Gbps and 5Gbps speeds on Cat 5e, 6 and 7 cables.
This series adds the necessary infrastructure to handle these modes with
C45 PHYs. This series was originally part of a bigger one, that has
seen 2 iterations [1] [2] that added support for these modes on Marvell
Alaska PHYs.
Following some discussions with Heiner and Andrew [3], we decided to
split-out the generic parts so that we can work together on the
following steps to get these mode fully working with Aquantia and
Marvell PHYS.
The first 3 patches are reworking some of the internal network phy
infrastructure to handle the new modes in a more generic way.
The 4th patch adds all the C45 register definition and accesses that
follows the 802.3bz standard to support 2.5GBASET and 5GBASET.
[1] : https://lore.kernel.org/netdev/20190118152352.26417-1-maxime.chevallier@bootlin.com/
[2] : https://lore.kernel.org/netdev/20190207094939.27369-1-maxime.chevallier@bootlin.com/
[3] : https://lore.kernel.org/netdev/81c340ea-54b0-1abf-94af-b8dc4ee83e3a@gmail.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The 802.3bz specification, based on previous by the NBASET alliance,
defines the 2.5GBaseT and 5GBaseT link modes for ethernet traffic on
cat5e, cat6 and cat7 cables.
These mode integrate with the already defined C45 MDIO PMA/PMD registers
set that added 10G support, by defining some previously reserved bits,
and adding a new register (2.5G/5G Extended abilities).
This commit adds the required definitions in include/uapi/linux/mdio.h
to support these modes, and detect when a link-partner advertises them.
It also adds support for these mode in the generic C45 PHY
infrastructure.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marvell 10G PHY driver has a generic way of initializing the supported
link modes by reading the PHY's C45 PMA abilities. This can be made
generic, since these registers are part of the 802.3 specifications.
This commit extracts the config_init link_mode initialization code from
marvell10g and uses it to introduce the genphy_c45_pma_read_abilities
function.
Only PMA modes are read, it's still up to the caller to set the Pause
parameters.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since of_set_phy_supported was moved to phy-core.c, we can also move
of_set_phy_eee_broken to the same location, so that we have all OF
functions in the same place.
This patch doesn't intend to introduce any change in behaviour.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting a PHY's max speed using either the max-speed DT property
or ethtool, we should mask-out all non-compatible modes according to the
settings table, instead of just the 10/100BASET modes.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Suggested-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for net:
1) Missing structure initialization in ebtables causes splat with
32-bit user level on a 64-bit kernel, from Francesco Ruggeri.
2) Missing dependency on nf_defrag in IPVS IPv6 codebase, from
Andrea Claudi.
3) Fix possible use-after-free from release path of target extensions.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcZKsDAAoJEEg/ir3gV/o+gewH/3yQYIgVIjibdw1YF5S4vGXV
M0FOsx2OEcaxVjsm+GzWTthARtIO/d5NuYO/j7QsWLopGugTGCSc6SusrVQtH7PY
b5Fz/pR+ElArt0Ovx06A1twdO5gtoT920rP+E+Gvl/ZR78ceknHUGxJBnGEh2p9D
svM3Q4mdmJZ+a3ehMCxMWS2vf7If43fhHXyt9OKEMpbzAJ03MVLVagt3OGUNi6QP
vl4Fuq5OX6zumKLXPpql8kh+Fbu9QHc8vcuwPnG8GmsBetS083HSC6EPRUbo/EPR
3IzuAibltXoYWlvCPpotCjjSuyqvkOdVudD5jyM3gFIS3XuJdnwkn6k5gvPCk6E=
=W+Ws
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2019-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2019-02-13
This series introduces some fixes to mlx5 driver.
For more information please see tag log below.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently mlx5 driver creates xdp redirect hw queues unconditionally on
netdevice open, This is great until someone starts redirecting XDP traffic
via ndo_xdp_xmit on mlx5 device and changes the device configuration at
the same time, this might cause crashes, since the other device's napi
is not aware of the mlx5 state change (resources un-availability).
To fix this we must synchronize with other devices napi's on the system.
Added a new flag under mlx5e_priv to determine XDP TX resources are
available, set/clear it up when necessary and use synchronize_rcu()
when the flag is turned off, so other napi's are in-sync with it, before
we actually cleanup the hw resources.
The flag is tested prior to committing to transmit on mlx5e_xdp_xmit, and
it is sufficient to determine if it safe to transmit or not. The other
two internal flags (MLX5E_STATE_OPENED and MLX5E_SQ_STATE_ENABLED) become
unnecessary. Thus, they are removed from data path.
Fixes: 58b99ee3e3 ("net/mlx5e: Add support for XDP_REDIRECT in device-out side")
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eliminate the following compilation warning:
drivers/net/ethernet/mellanox/mlx5/core/events.c: warning: 'error_str'
may be used uninitialized in this function [-Wuninitialized]: => 238:3
Fixes: c2fb3db22d ("net/mlx5: Rework handling of port module events")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Mikhael Goikhman <migo@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When EEH is injected and PCI bus stalls, mlx5's pci error detect
function is called to deactivate the command interface and tear down
the device. The issue is that there can be a thread that already
passed MLX5_DEVICE_STATE_INTERNAL_ERROR check, it will send the command
and stuck in the wait_func.
Solution:
Add function mlx5_cmd_flush to disable command interface and clear all
the pending commands. When device state is set to
MLX5_DEVICE_STATE_INTERNAL_ERROR, call mlx5_cmd_flush to ensure all
pending threads waiting for firmware commands completion are terminated.
Fixes: c1d4d2e92a ("net/mlx5: Avoid calling sleeping function by the health poll thread")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
New channels are applied to the priv channels only after they
are successfully opened. Then, the indirection table should be built
according to the new number of channels.
Currently, such build is preformed independently of whether the
channels opening is successful, and is not reverted on failure.
The bug is caused due to removal of rss params from channels struct
and moving it to priv struct. That change cause to independency between
channels and rss params.
This causes a crash on a later point, when accessing rqn of a non
existing channel.
This patch fixes it by moving the indirection table build right before
switching the priv channels to new channels struct, after the new set of
channels was successfully opened.
Fixes: bbeb53b8b2 ("net/mlx5e: Move RSS params to a dedicated struct")
Signed-off-by: Maria Pasechnik <mariap@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
it processes the args but does not update the remaining length
of the buffer that the string arguments will be placed in. It
constantly passes in the total size of buffer used instead of
passing in the remaining size of the buffer used. This could cause
issues if the strings are larger than the max size of an event
which could cause the strings to be written beyond what was reserved
on the buffer.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXGN7BRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoa1AQD7+6O0DncGwk5aWqRHESXKlmOWteW6
eMFbEw3KDcvs2gEAvNLB1i2yVH6Enn50M0KpmYJMbyZK/LVn2QsPZfU/LgQ=
=KMBZ
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"This fixes kprobes/uprobes dynamic processing of strings, where it
processes the args but does not update the remaining length of the
buffer that the string arguments will be placed in. It constantly
passes in the total size of buffer used instead of passing in the
remaining size of the buffer used.
This could cause issues if the strings are larger than the max size of
an event which could cause the strings to be written beyond what was
reserved on the buffer"
* tag 'trace-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: probeevent: Correctly update remaining space in dynamic area
Update newly introduced function to be aligned to netdev coding style.
Fixes: 46861e3e88 ("net/mlx5: Set ODP SRQ support in firmware")
Reported-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Don't warn or fail if it's missing.
v2: handle xgmi case more gracefully.
v3: handle older kernels properly
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In the middle of do_exit() there is there is a call
"ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
for the debugger to release the task or SIGKILL to be delivered.
Skipping past dequeue_signal when we know a fatal signal has already
been delivered resulted in SIGKILL remaining pending and
TIF_SIGPENDING remaining set. This in turn caused the
scheduler to not sleep in PTACE_EVENT_EXIT as it figured
a fatal signal was pending. This also caused ptrace_freeze_traced
in ptrace_check_attach to fail because it left a per thread
SIGKILL pending which is what fatal_signal_pending tests for.
This difference in signal state caused strace to report
strace: Exit of unknown pid NNNNN ignored
Therefore update the signal handling state like dequeue_signal
would when removing a per thread SIGKILL, by removing SIGKILL
from the per thread signal mask and clearing TIF_SIGPENDING.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Ivan Delalande <colona@arista.com>
Cc: stable@vger.kernel.org
Fixes: 35634ffa17 ("signal: Always notice exiting tasks")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Some protocols have other means to verify the payload integrity
(AH, ESP, SCTP) while others are incompatible with nf_ip(6)_checksum
implementation because checksum is either optional or might be
partial (UDPLITE, DCCP, GRE). Because nf_ip(6)_checksum was used
to validate the packets, ip(6)tables REJECT rules were not capable
to generate ICMP(v6) errors for the protocols mentioned above.
This commit also fixes the incorrect pseudo-header protocol used
for IPv4 packets that carry other transport protocols than TCP or
UDP (pseudo-header used protocol 0 iso the proper value).
Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit bb36489032 ("mmc: meson-gx: Free irq in release() callback")
changed the _probe code to use request_threaded_irq() instead of
devm_request_threaded_irq().
Unfortunately this removes a fallback for the interrupt name:
devm_request_threaded_irq() uses the device name as fallback if the
given IRQ name is NULL. request_threaded_irq() has no such fallback,
thus /proc/interrupts shows "(null)" instead.
Explicitly pass the dev_name() so we get the IRQ name shown in
/proc/interrupts again.
While here, also fix the indentation of the request_threaded_irq()
parameter list.
Fixes: bb36489032 ("mmc: meson-gx: Free irq in release() callback")
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
- Fix CSI register offsets for i.MX51 and i.MX53.
- Fix delayed page flip completion events on i.MX6QP due to unexpected
behaviour of the PRE when issuing NOP buffer updates to the same
buffer address.
- Stop throwing errors for plane updates on disabled CRTCs when a
userspace process is killed while a plane update is pending.
- Add missing of_node_put cleanup in imx_ldb_bind.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQRRO6F6WdpH1R0vGibVhaclGDdiwAUCXGL3ZBcccC56YWJlbEBw
ZW5ndXRyb25peC5kZQAKCRDVhaclGDdiwDz3AP49Ldp9TuuS/bMHWcZrPXRjiWJO
9lRaWV3mKqMFsAQzswD9Gy0+eybyaPQqRvLXXxwuk7Jrh8or2TOyOqDVeoDC4Q8=
=zCye
-----END PGP SIGNATURE-----
Merge tag 'imx-drm-fixes-2019-02-12' of git://git.pengutronix.de/pza/linux into drm-fixes
drm/imx: plane, ldb, and ipu-v3 fixes
- Fix CSI register offsets for i.MX51 and i.MX53.
- Fix delayed page flip completion events on i.MX6QP due to unexpected
behaviour of the PRE when issuing NOP buffer updates to the same
buffer address.
- Stop throwing errors for plane updates on disabled CRTCs when a
userspace process is killed while a plane update is pending.
- Add missing of_node_put cleanup in imx_ldb_bind.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Philipp Zabel <p.zabel@pengutronix.de>
Link: https://patchwork.freedesktop.org/patch/msgid/1549990602.4800.11.camel@pengutronix.de
Florian Fainelli says:
====================
Remove unused variables
This removes unused variables from mlxsw and ethsw after the recent
removal of SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS, build scripts are now
fixed to take care of those warnings :).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After 1b8b589d91 ("staging: fsl-dpaa2: ethsw: Remove getting
PORT_BRIDGE_FLAGS") we are not accessing any driver private data
anymore, remove port_priv which is unused.
Fixes: 1b8b589d91 ("staging: fsl-dpaa2: ethsw: Remove getting PORT_BRIDGE_FLAGS")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After 1ecb195753 ("mlxsw: spectrum_switchdev: Remove getting
PORT_BRIDGE_FLAGS") we are not accessing any driver private data
structure, so the mlxsw_sp_port and mlxsw_sp variables are unused.
Fixes: 1ecb195753 ("mlxsw: spectrum_switchdev: Remove getting PORT_BRIDGE_FLAGS")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge fixes from Andrew Morton:
"6 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: proc: smaps_rollup: fix pss_locked calculation
Rename include/{uapi => }/asm-generic/shmparam.h really
Revert "mm: use early_pfn_to_nid in page_ext_init"
mm/gup: fix gup_pmd_range() for dax
Revert "mm: slowly shrink slabs with a relatively small number of objects"
Revert "mm: don't reclaim inodes with many attached pages"
The 'pss_locked' field of smaps_rollup was being calculated incorrectly.
It accumulated the current pss everytime a locked VMA was found. Fix
that by adding to 'pss_locked' the same time as that of 'pss' if the vma
being walked is locked.
Link: http://lkml.kernel.org/r/20190203065425.14650-1-sspatil@android.com
Fixes: 493b0e9d94 ("mm: add /proc/pid/smaps_rollup")
Signed-off-by: Sandeep Patil <sspatil@android.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: <stable@vger.kernel.org> [4.14.x, 4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit fe53ca5427 ("mm: use early_pfn_to_nid in
page_ext_init").
When booting a system with "page_owner=on",
start_kernel
page_ext_init
invoke_init_callbacks
init_section_page_ext
init_page_owner
init_early_allocated_pages
init_zones_in_node
init_pages_in_zone
lookup_page_ext
page_to_nid
The issue here is that page_to_nid() will not work since some page flags
have no node information until later in page_alloc_init_late() due to
DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.
UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
index 7 is out of range for type 'zone [5]'
Also, kernel will panic since flags were poisoned earlier with,
CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
start_kernel
setup_arch
pagetable_init
paging_init
sparse_init
sparse_init_nid
memblock_alloc_try_nid_raw
It did not handle it well in init_pages_in_zone() which ends up calling
page_to_nid().
page:ffffea0004200000 is uninitialized and poisoned
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
page_owner info is not active (free page?)
kernel BUG at include/linux/mm.h:990!
RIP: 0010:init_page_owner+0x486/0x520
This means that assumptions behind commit fe53ca5427 ("mm: use
early_pfn_to_nid in page_ext_init") are incomplete. Therefore, revert
the commit for now. A proper way to move the page_owner initialization
to sooner is to hook into memmap initialization.
Link: http://lkml.kernel.org/r/20190115202812.75820-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For dax pmd, pmd_trans_huge() returns false but pmd_huge() returns true
on x86. So the function works as long as hugetlb is configured.
However, dax doesn't depend on hugetlb.
Link: http://lkml.kernel.org/r/20190111034033.601-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 172b06c32b ("mm: slowly shrink slabs with a
relatively small number of objects").
This change changes the agressiveness of shrinker reclaim, causing small
cache and low priority reclaim to greatly increase scanning pressure on
small caches. As a result, light memory pressure has a disproportionate
affect on small caches, and causes large caches to be reclaimed much
faster than previously.
As a result, it greatly perturbs the delicate balance of the VFS caches
(dentry/inode vs file page cache) such that the inode/dentry caches are
reclaimed much, much faster than the page cache and this drives us into
several other caching imbalance related problems.
As such, this is a bad change and needs to be reverted.
[ Needs some massaging to retain the later seekless shrinker
modifications.]
Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com
Fixes: 172b06c32b ("mm: slowly shrink slabs with a relatively small number of objects")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Wolfgang Walter <linux@stwm.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Spock <dairinin@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit a76cf1a474 ("mm: don't reclaim inodes with many
attached pages").
This change causes serious changes to page cache and inode cache
behaviour and balance, resulting in major performance regressions when
combining worklaods such as large file copies and kernel compiles.
https://bugzilla.kernel.org/show_bug.cgi?id=202441
This change is a hack to work around the problems introduced by changing
how agressive shrinkers are on small caches in commit 172b06c32b ("mm:
slowly shrink slabs with a relatively small number of objects"). It
creates more problems than it solves, wasn't adequately reviewed or
tested, so it needs to be reverted.
Link: http://lkml.kernel.org/r/20190130041707.27750-2-david@fromorbit.com
Fixes: a76cf1a474 ("mm: don't reclaim inodes with many attached pages")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Wolfgang Walter <linux@stwm.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Spock <dairinin@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sync_request_write no longer submits writes to a Faulty device. This has
the unfortunate side effect that bitmap bits can be incorrectly cleared
if a recovery is interrupted (previously, end_sync_write would have
prevented this). This means the next recovery may not copy everything
it should, potentially corrupting data.
Add a function for doing the proper md_bitmap_end_sync, called from
end_sync_write and the Faulty case in sync_request_write.
backport note to 4.14: s/md_bitmap_end_sync/bitmap_end_sync
Cc: stable@vger.kernel.org 4.14+
Fixes: 0c9d5b127f ("md/raid1: avoid reusing a resync bio after error handling.")
Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Tested-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
After 610d2b601b ("rocker: Remove getting PORT_BRIDGE_FLAGS") we no
longer have a port_attr_bridge_flags_get member in the rocker_world_ops
structre, fix that.
Fixes: 610d2b601b ("rocker: Remove getting PORT_BRIDGE_FLAGS")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This tag contains a pair of bug fixes that I'd like to include in 5.0:
* A fix to disambiguate swap from invalid PTEs, which fixes an error
when trying to unmap PROT_NONE pages.
* A revert to an optimization of the size of flat binaries. This is
really a workaround to prevent breaking existing boot flows, but since
the change was introduced as part of the 5.0 merge window I'd like to
have the fix in before 5.0 so we can avoid a regression for any proper
releases.
With these I hope we're out of patches for 5.0 in RISC-V land.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEAM520YNJYN/OiG3470yhUCzLq0EFAlxjN2gTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRDvTKFQLMurQY73EACOoplTb4zBKXY4pCSZG/Gnu+AEJz77
CS1wR86TThkyxIJTlIGmX0xSFuds4f+ymCMbZ7SfXa20es62iZtEQA72TVb+g80L
hD/0yZFZQkcBEQOVOC2ZWvaj4tVZB703yHYZGSnN/6sHISS0yoO8fqf19SGSmDC3
aCNE5Fm6z5MNozS6Z5Lrtpw2rmpDwQCHnxfLPzEvjC9DPMwzHQbyP3tALWcfLpK7
aIPmq/xjUtS5eVb/3CHhLmkYFTNlOs4p0jsKjw2Sw/3NWHCXrPw6afrjZ533j8nR
BI/u/EErMJWl6qLasdzTTfhE16sLDzbowiGdOfSurjwRDkAokw9/zvLDLDzLXwoT
UQkCcp2mnOsCmdXaC8L0yUi3mHh2M82GL1RXq4POyCgD77DqVvAJ/ot/Lq3UFila
T2rfsxIFC2pHmdDlQe1zo31g6TeI4FJaHA4juNXce2bIbtaU4OxqiXfN/cKjyAX2
8hevxUwy/PY0aqzLl14Xg2tcdurVbFh8BZ7eWYwqWsTrVwRg0ZhnWG0FdopNMA9D
z5nBbE3KnVEDrMh7QzSgtNDzSmWRUKD0S5B0t4/lMv9uvYrtqeBXCKS3vJ1tc0zy
tzj+tyjmgsfkAIVfA/fQqznx7sXAdD/jCPkUtotQUP2YkKRFbVyLjVTXKQ3BiyAl
xrZhsMPr8BPj0Q==
=EjWP
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-5.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
Pull RISC-V fixes from Palmer Dabbelt:
"This contains a pair of bug fixes that I'd like to include in 5.0:
- A fix to disambiguate swap from invalid PTEs, which fixes an error
when trying to unmap PROT_NONE pages.
- A revert to an optimization of the size of flat binaries. This is
really a workaround to prevent breaking existing boot flows, but
since the change was introduced as part of the 5.0 merge window I'd
like to have the fix in before 5.0 so we can avoid a regression for
any proper releases.
With these I hope we're out of patches for 5.0 in RISC-V land"
* tag 'riscv-for-linus-5.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
Revert "RISC-V: Make BSS section as the last section in vmlinux.lds.S"
riscv: Add pte bit to distinguish swap from invalid
The current opt_inst_list operations inside team_nl_cmd_options_set()
is too complex to track:
LIST_HEAD(opt_inst_list);
nla_for_each_nested(...) {
list_for_each_entry(opt_inst, &team->option_inst_list, list) {
if (__team_option_inst_tmp_find(&opt_inst_list, opt_inst))
continue;
list_add(&opt_inst->tmp_list, &opt_inst_list);
}
}
team_nl_send_event_options_get(team, &opt_inst_list);
as while we retrieve 'opt_inst' from team->option_inst_list, it could
be added to the local 'opt_inst_list' for multiple times. The
__team_option_inst_tmp_find() doesn't work, as the setter
team_mode_option_set() still calls team->ops.exit() which uses
->tmp_list too in __team_options_change_check().
Simplify the list operations by moving the 'opt_inst_list' and
team_nl_send_event_options_get() into the nla_for_each_nested() loop so
that it can be guranteed that we won't insert a same list entry for
multiple times. Therefore, __team_option_inst_tmp_find() can be removed
too.
Fixes: 4fb0534fb7 ("team: avoid adding twice the same option to the event list")
Fixes: 2fcdb2c9e6 ("team: allow to send multiple set events in one message")
Reported-by: syzbot+4d4af685432dc0e56c91@syzkaller.appspotmail.com
Reported-by: syzbot+68ee510075cf64260cc4@syzkaller.appspotmail.com
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang says:
====================
net_sched: some fixes for cls_tcindex
This patchset contains 3 bug fixes for tcindex filter. Please check
each patch for details.
v2: fix a compile error in patch 2
drop netns refcnt in patch 1
====================
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
struct tcindex_filter_result contains two parts:
struct tcf_exts and struct tcf_result.
For the local variable 'cr', its exts part is never used but
initialized without being released properly on success path. So
just completely remove the exts part to fix this leak.
For the local variable 'new_filter_result', it is never properly
released if not used by 'r' on success path.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tcindex_destroy() destroys all the filter results in
the perfect hash table, it invokes the walker to delete
each of them. However, results with class==0 are skipped
in either tcindex_walk() or tcindex_delete(), which causes
a memory leak reported by kmemleak.
This patch fixes it by skipping the walker and directly
deleting these filter results so we don't miss any filter
result.
As a result of this change, we have to initialize exts->net
properly in tcindex_alloc_perfect_hash(). For net-next, we
need to consider whether we should initialize ->net in
tcf_exts_init() instead, before that just directly test
CONFIG_NET_CLS_ACT=y.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcindex_destroy() invokes tcindex_destroy_element() via
a walker to delete each filter result in its perfect hash
table, and tcindex_destroy_element() calls tcindex_delete()
which schedules tcf RCU works to do the final deletion work.
Unfortunately this races with the RCU callback
__tcindex_destroy(), which could lead to use-after-free as
reported by Adrian.
Fix this by migrating this RCU callback to tcf RCU work too,
as that workqueue is ordered, we will not have use-after-free.
Note, we don't need to hold netns refcnt because we don't call
tcf_exts_destroy() here.
Fixes: 27ce4f05e2 ("net_sched: use tcf_queue_work() in tcindex filter")
Reported-by: Adrian <bugs@abtelecom.ro>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arthur Kiyanovski says:
====================
net: ena: race condition bug fix and version update
This patchset includes a fix to a race condition that can cause
kernel panic, as well as a driver version update because of this
fix.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>