* A less nasty use-after-free that can only zero some words at
the beginning of the page, and hence is not really exploitable
* A NULL pointer dereference
* A dummy implementation of an AMD chicken bit MSR that Windows uses
for some unknown reason
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJcGWZUAAoJEL/70l94x66DhecIAKhyarwvQ1HyKUCmeeVAuQHi
8RW8iisiCAQdhtrzjrOjt025gYskeuKItw0V/ipg0orgTTgti4TRoJpfDpZV+550
2EHT2UyUNCjSUwwtUNJ60ky+GRyFCgY8kdqiMTqFmFnpbK/TWY7jx7jtPToRQron
tL1H0RCeYJlPThK/UM7i3UIvS5oVIZ8YOJ18PVcKCoiynrbPk7wgw3it5OdiO2bS
VnI5JMfNy4YIXPc6QqdcKmsLqtinHxVObg+0PnLFtaI6xUqzTzEa12toZxsc9Sf3
rHyYhpNO8kaTw2inLpvbgLX3TZbw5DckHEoOn+s4e5Q193HGTjsTfG9P9oph2hA=
=VQE8
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- One nasty use-after-free bugfix, from this merge window however
- A less nasty use-after-free that can only zero some words at the
beginning of the page, and hence is not really exploitable
- A NULL pointer dereference
- A dummy implementation of an AMD chicken bit MSR that Windows uses
for some unknown reason
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs
KVM: X86: Fix NULL deref in vcpu_scan_ioapic
KVM: Fix UAF in nested posted interrupt processing
KVM: fix unregistering coalesced mmio zone from wrong bus
Fix a regression in dma-direct that didn't take account the magic AMD
memory encryption mask in the DMA address.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlwaVAULHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYOY4A/9FYHCK486kbEA1df8DxLT6Hz0U6NU9fxgOBMb347K
uQdmdRB8oR3+aaWnohdr6/qT2oAEn3HM9db4L+eu4D28Fxjfv9Pn0QwjFnAUKGKS
90ZvmE8FxHOZydR0l3glHgSxuK+aKGs4Qv6Im5vDWkzwVfMqN0WDbBUfbNbG4nsh
spkdRnx32orstPpUkqbr/26iUAFk9TI3tvg7QDgL88cFArwCBW5ZTmznCnbFeGoz
TxptRFzDYslWZJYAgKWJ9VQ1sCRyZeMD6maoHsFZNL+XQiZ9/WzLN8GZm8/Tn7sP
Rcc8D0qkNqcZbUFku2n1r2Z9iHjB3+fI87zdaarSytB5mhHNkpES5boC2ExTjXTu
QsQt33zbrJXO+8F/571u3k5n8QkkazmYewtviGnNpGq2AofinSOU4BsKxQePI00l
gN3FjVpFyvgxzRdxp4kpa3UfRFOp61zWqVYZYzGrH8Au99t4ineD+iY1ztd0Z2mH
23zjAfnCjGF8VOBNi9vxQFeNLlZoWCCuShENePJ76Bm6lBNjPT1O2Tilc+pbBs2A
RJPoUgMCSfZ+RHyS3ohKM/LLJaX1wEQHYTXa6FyskvOWa9vD4bXJ4kNwc5LY7SyK
i+KAnEierBxvtRMzo5uu9DJXaJZx1Cw711/2IAPkmXUMDld5rwsrdEbgqDD2mlD8
6l0=
=EybB
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.20-4' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
"Fix a regression in dma-direct that didn't take account the magic AMD
memory encryption mask in the DMA address"
* tag 'dma-mapping-4.20-4' of git://git.infradead.org/users/hch/dma-mapping:
dma-direct: do not include SME mask in the DMA supported check
When dumping proxy entries the dump request has NTF_PROXY set in
ndm_flags. strict mode checking needs to be updated to allow this
flag.
Fixes: 51183d233b ("net/neighbor: Update neigh_dump_info for strict data checking")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pneigh_lookup uses kmalloc versus kzalloc when new entries are allocated.
Given that the newly added protocol field needs to be initialized.
Fixes: df9b0e30d4 ("neighbor: Add protocol attribute")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch refactors reuseport_add_any selftest a bit:
- makes it more modular (eliminates several copy/pasted blocks);
- skips DCCP tests if DCCP is not supported
V2: added "Signed-off-by" tag.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 6320 family of switches uses the same watchdog registers as the
6390.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mvpp2_phylink_validate() sets all modes that are supported by a
given PPv2 port. An mistake made the 10000baseT_Full mode being
advertised in some cases when a port wasn't configured to perform at
10G. This patch fixes this.
Fixes: d97c9f4ab0 ("net: mvpp2: 1000baseX support")
Reported-by: Russell King <linux@armlinux.org.uk>
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When replacing a rule we add the new rule to the rhashtable
but only remove the old if not in skip_sw.
This commit fix this and remove the old rule anyway.
Fixes: 35cc3cefc4 ("net/sched: cls_flower: Reject duplicated rules also under skip_sw")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Biao Huang says:
====================
add ethernet binding and modify ethernet driver for mt2712
changes in v3:
resend this series base on the latest net-next tree.
changes in v2 as comments from Sean:
1. fix typo.
2. use capital letters for RMII/MII/RGMII in driver and bindings.
v1:
This new series is the result of discussion in:
http://lkml.org/lkml/2018/12/13/1007http://lkml.org/lkml/2018/12/14/53
1. ethernet binding file move to this series.
2. remove fine tune property in device tree
3. remove fine tune flow in ethernet driver
4. set rgmii timing according to the value in device tree,
and don't care whether phy insert internal delay or not.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
1. remove fine-tune property and related setting to simplify
the timing adjustment flow.
2. set timing value according to the value from device tree,
and will not care whether PHY insert internal delay.
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove fine-tune property in device tree, modify
the corresponding description in dt-binding.
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip l add dev tun type gretap external
ip r a 10.0.0.1 encap ip dst 192.168.152.171 id 1000 dev gretap
For gretap Key example when the command set the id but don't set the
TUNNEL_KEY flags. There is no key field in the send packet
In the lwtunnel situation, some TUNNEL_FLAGS should can be set by
userspace
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MAC Reset was noticed to erase important EEPROM settings.
It is also unnecessary since a chip wide reset was done earlier
in initialization, and that reset preserves EEPROM settings.
There for this patch removes the unnecessary MAC specific reset.
Signed-off-by: Bryan Whitehead <Bryan.Whitehead@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some fixes for the mlx5 driver
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcGrikAAoJEEg/ir3gV/o+tSEH/A6192OXs+B6phGdMr6GyhbK
nzw6bkJfjHLMDj00F+9qLYYQbqfG1QjDYZnthPCezD9Kv0thFuP8cvSSY1S6SsJq
uMwlHwYQ4ljZvl49pkwiiKMufWGcuVYf+NiEmvrYVWXugej1wENe1xiTS9V3OCAA
j9gQA7gYeEwveOLWLakwzbfvLJ+fyUkLDjB1EJPOzh2fnfkGl8Z8LovAmoXOywdn
PdMh9xAlqovJyY9kS4vaj/1WDCO9+rUfGHXDLjjzG5ygsfipyb8xSdzsY9WgriMu
NKgg1VpEjc9egalT9qqznUE7ilfBUvW++Yc0wsrH+Cme3Rj322L3Es+j5fduKGo=
=YiD1
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2018-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-fixes-2018-12-19
Some fixes for the mlx5 driver
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu says:
====================
neigh get support
This series adds support for neigh get similar
to route and recently added fdb get.
v2: fix key len check. and some other fixes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
this patch registers neigh doit handler. The doit handler
returns a neigh entry given dst and dev. This is similar
to route and fdb doit (get) handlers. Also moves nda_policy
declaration from rtnetlink.c to neighbour.c
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expression terminated with "," instead of ";", resulted in
set_fte getting bad value for modify_enable_mask field.
Fixes: bd5251dbf1 ("net/mlx5_core: Introduce flow steering destination of type counter")
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When the completion queue of the RQ is empty, do not immediately return.
If left-over decompressed CQEs (from the previous cycle) were processed,
need to go to the finalization part of the poll function.
Bug exists only when CQE compression is turned ON.
This solves the following issue:
mlx5_core 0000:82:00.1: mlx5_eq_int:544:(pid 0): CQ error on CQN 0xc08, syndrome 0x1
mlx5_core 0000:82:00.1 p4p2: mlx5e_cq_error_event: cqn=0x000c08 event=0x04
Fixes: 4b7dfc9925 ("net/mlx5e: Early-return on empty completion queues")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Ido Schimmel says:
====================
mlxsw: Make driver more robust
In recent months we fixed several bugs in the driver that could have
been avoided by re-evaluating some of the involved code paths and by
introducing relevant and comprehensive test cases.
This patchset tries to do that by introducing a set of small and mostly
non-functional changes in addition to a new test. I have further
improvements in mind, but they can be done in a different set.
Patch #1 makes sure we correctly sanitize upper devices of a VLAN
interface.
Patch #2 removes an unexpected behavior from the driver, in which routes
configured on a VLAN interface will cease being offloaded after certain
operations.
Patch #3 is a small cleanup.
Patch #4 simplifies the driver by removing reference counting from VLAN
entries configured on a port.
Patches #5-#6 simplify linking/unlinking from a bridge, especially when
LAG and VLAN devices are involved. They make both operations symmetric
even when ports are unlinked from a bridged LAG device.
Patch #7-#9 make router interface (RIF) deletion more robust by removing
reliance on device chain to indicate whether a NETDEV_DOWN event in the
inet{,6}addr notification chains should be processed. This is due to the
fact that IP addresses can be flushed from a netdev after it was
unlinked from its lower device.
Patch #10 adds a new test to for valid and invalid configurations over
mlxsw ports. Some of the test cases are derived from recent fixes. I
expect that more test cases will be added over time.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new test that is focused on rtnetlink configuration. Its purpose
is to test valid and invalid (as deemed by mlxsw) configurations and
make sure that they succeed / fail without producing a trace.
Some of the test cases are derived from recent fixes in order to make
sure that the fixed bugs are not introduced again.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous patches tried to make RIF deletion more robust and avoid
use-after-free situations.
As another precaution, hold a reference on a RIF's netdev and release it
when the RIF is deleted.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the past we had multiple instances where RIFs were not properly
deleted.
One of the reasons for leaking a RIF was that at the time when IP
addresses were flushed from the respective netdev (prompting the
destruction of the RIF), the netdev was no longer a mlxsw upper. This
caused the inet{,6}addr notification blocks to ignore the NETDEV_DOWN
event and leak the RIF.
Instead of checking whether the netdev is our upper when an IP address
is removed, we can instead check if the netdev has a RIF configured.
To look up a RIF we need to access mlxsw private data, so the patch
stores the notification blocks inside a mlxsw struct. This then allows
us to use container_of() and extract the required private data.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Next patch is going to make RIF deletion more robust by removing
reliance on fragile mlxsw_sp_lower_get(). This is because a netdev is
not necessarily our upper anymore when its IP addresses are flushed.
The inet{,6}addr notification blocks are going to resolve 'struct
mlxsw_sp' using container_of(), but the functions they call still use
mlxsw_sp_lower_get().
As a preparation for the next patch, propagate 'struct mlxsw_sp' down to
the functions called from the notification blocks and remove reliance on
mlxsw_sp_lower_get().
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a LAG device or a VLAN device on top of it is enslaved to a bridge,
the driver propagates the CHANGEUPPER event to the LAG's slaves.
This causes each physical port to increase the reference count of the
internal representation of the bridge port by calling
mlxsw_sp_port_bridge_join().
However, when a port is removed from a LAG, the corresponding leave()
function is not called and the reference count is not decremented. This
leads to ugly hacks such as mlxsw_sp_bridge_port_should_destroy() that
try to understand if the bridge port should be destroyed even when its
reference count is not 0.
Instead, make sure that when a port is unlinked from a LAG it would see
the same events as if the LAG (or its uppers) were unlinked from a
bridge.
The above is achieved by walking the LAG's uppers when a port is
unlinked and calling mlxsw_sp_port_bridge_leave() for each upper that is
enslaved to a bridge.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b3529af6bb ("spectrum: Reference count VLAN entries") started
reference counting port-VLAN entries in a similar fashion to the 8021q
driver.
However, this is not actually needed and only complicates things.
Instead, the driver should forbid the creation of a VLAN on a port if
this VLAN already exists. This would also solve the issue fixed by the
mentioned commit.
Therefore, remove the get()/put() API and use create()/destroy()
instead.
One place that needs special attention is VLAN addition in a VLAN-aware
bridge via switchdev operations. In case the VLAN flags (e.g., 'pvid')
are toggled, then the VLAN entry already exists. To prevent the driver
from wrongly returning EEXIST, the driver is changed to check in the
prepare phase whether the entry already exists and only returns an error
in case it is not associated with the correct bridge port.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 993107fea5 ("mlxsw: spectrum_switchdev: Fix VLAN device
deletion via ioctl") I fixed a bug caused by the fact that the driver
views differently the deletion of a VLAN device when it is deleted via
an ioctl and netlink.
Instead of relying on a specific order of events (device being
unregistered vs. VLAN filter being updated), simply make sure that the
driver performs the necessary cleanup when the VLAN device is unlinked,
which always happens before the other two events.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is no longer used. Remove it.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when a RIF is constructed on top of a FID, the RIF increments
the FID's reference count and the RIF is destroyed when the FID's
reference count drops to 1. This effectively means that when no local
ports are member in the FID, the FID is destroyed regardless if the
router port is a member in the FID or not.
The above can lead to the unexpected behavior in which routes using a
VLAN interface as their nexthop device are no longer offloaded after the
last local port leaves the corresponding VLAN (FID).
Example:
# ip -4 route show dev br0.10
192.0.2.0/24 proto kernel scope link src 192.0.2.1 offload
# bridge vlan del vid 10 dev swp3
# ip -4 route show dev br0.10
192.0.2.0/24 proto kernel scope link src 192.0.2.1
After the patch, the route is offloaded before and after the VLAN is
removed from local port 'swp3', as the RIF corresponding to 'br0.10'
continues to exists.
In order to remove RIFs' reliance on the underlying FID's reference
count, we need to add a reference count to sub-port RIFs, which are RIFs
that correspond to physical ports and their uppers (e.g., LAG devices).
In this case, each {Port, VID} ('struct mlxsw_sp_port_vlan') needs to
hold a reference on the RIF. For example:
bond0.10
|
bond0
|
+-------+
| |
swp1 swp2
Both {Port 1, VID 10} and {Port 2, VID 10} will hold a reference on the
RIF corresponding to 'bond0.10'. When the last reference is dropped, the
RIF will be destroyed.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, only VRF and macvlan uppers are supported on top of VLAN
device configured over a bridge, so make sure the driver forbids other
uppers.
Note that enslavement to a VRF is handled earlier in the notification
block, so there is no need to check for a VRF upper here.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported the use of uninitialized udp6_addr::sin6_scope_id.
We can just set ::sin6_scope_id to zero, as tunnels are unlikely
to use an IPv6 address that needs a scope id and there is no
interface to bind in this context.
For net-next, it looks different as we have cfg->bind_ifindex there
so we can probably call ipv6_iface_scope_id().
Same for ::sin6_flowinfo, tunnels don't use it.
Fixes: 8024e02879 ("udp: Add udp_sock_create for UDP tunnels to open listener socket")
Reported-by: syzbot+c56449ed3652e6720f30@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending broadcast message on high load system, there are a lot of
unnecessary packets restranmission. That issue was caused by missing in
initial criteria for retransmission.
To prevent this happen, just initialize this criteria for retransmission
in next 10 milliseconds.
Fixes: 31c4f4cc32 ("tipc: improve broadcast retransmission algorithm")
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tuong Lien says:
====================
tipc: tracepoints and trace_events in TIPC
The patch series is the first step of introducing a tracing framework in
TIPC, which will assist in collecting complete & plentiful data for post
analysis, even in the case of a single failure occurrence e.g. when the
failure is unreproducible.
The tracing code in TIPC utilizes the powerful kernel tracepoints, trace
events features along with particular dump functions to trace the TIPC
object data and events (incl. bearer, link, socket, node, etc.).
The tracing code should generate zero-load to TIPC when the trace events
are not enabled.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit adds the new trace_event for TIPC bearer, L2 device event:
trace_tipc_l2_device_event()
Also, it puts the trace at the tipc_l2_device_event() function, then
the device/bearer events and related info can be traced out during
runtime when needed.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit adds the new trace_events for TIPC node object:
trace_tipc_node_create()
trace_tipc_node_delete()
trace_tipc_node_lost_contact()
trace_tipc_node_timeout()
trace_tipc_node_link_up()
trace_tipc_node_link_down()
trace_tipc_node_reset_links()
trace_tipc_node_fsm_evt()
trace_tipc_node_check_state()
Also, enables the traces for the following cases:
- When a node is created/deleted;
- When a node contact is lost;
- When a node timer is timed out;
- When a node link is up/down;
- When all node links are reset;
- When node state is changed;
- When a skb comes and node state needs to be checked/updated.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit adds the new trace_events for TIPC link object:
trace_tipc_link_timeout()
trace_tipc_link_fsm()
trace_tipc_link_reset()
trace_tipc_link_too_silent()
trace_tipc_link_retrans()
trace_tipc_link_bc_ack()
trace_tipc_link_conges()
And the traces for PROTOCOL messages at building and receiving:
trace_tipc_proto_build()
trace_tipc_proto_rcv()
Note:
a) The 'tipc_link_too_silent' event will only happen when the
'silent_intv_cnt' is about to reach the 'abort_limit' value (and the
event is enabled). The benefit for this kind of event is that we can
get an early indication about TIPC link loss issue due to timeout, then
can do some necessary actions for troubleshooting.
For example: To trigger the 'tipc_proto_rcv' when the 'too_silent'
event occurs:
echo 'enable_event:tipc:tipc_proto_rcv' > \
events/tipc/tipc_link_too_silent/trigger
And disable it when TIPC link is reset:
echo 'disable_event:tipc:tipc_proto_rcv' > \
events/tipc/tipc_link_reset/trigger
b) The 'tipc_link_retrans' or 'tipc_link_bc_ack' event is useful to
trace TIPC retransmission issues.
In addition, the commit adds the 'trace_tipc_list/link_dump()' at the
'retransmission failure' case. Then, if the issue occurs, the link
'transmq' along with the link data can be dumped for post-analysis.
These dump events should be enabled by default since it will only take
effect when the failure happens.
The same approach is also applied for the faulty case that the
validation of protocol message is failed.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code has 2 problems. It assumes that the RX ring for
the loopback packet is combined with the TX ring. This is not
true if the ethtool channels are set to non-combined mode. The
second problem is that it won't work on 57500 chips without
adjusting the logic to get the proper completion ring (cpr) pointer.
Fix both issues by locating the proper cpr pointer through the RX
ring.
Fixes: e44758b78a ("bnxt_en: Use bnxt_cp_ring_info struct pointer as parameter for RX path.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal says:
====================
sk_buff: add extension infrastructure
TL;DR:
- objdiff shows no change if CONFIG_XFRM=n && BR_NETFILTER=n
- small size reduction when one or both options are set
- no changes in ipsec performance
Changes since v1:
- Allocate entire extension space from a kmem_cache.
- Avoid atomic_dec_and_test operation on skb_ext_put() for refcnt == 1 case.
(similar to kfree_skbmem() fclone_ref use).
This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.
The third (future) user is Multipath TCP which is still out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.
This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.
Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.
mptcp has same requirements as secpath/bridge netfilter:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.
Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
are available for this skb.
This has two purposes.
a) avoids the need to initialize the pointer.
b) allows to "delete" an extension by clearing its bit
value in ->active_extensions.
While it would be possible to store the active_extensions byte
in the extension struct instead of sk_buff, there is one problem
with this:
When an extension has to be disabled, we can always clear the
bit in skb->active_extensions. But in case it would be stored in the
extension buffer itself, we might have to COW it first, if
we are dealing with a cloned skb. On kmalloc failure we would
be unable to turn an extension off.
2. extension pointer, located at the end of the sk_buff.
If the active_extensions byte is 0, the pointer is undefined,
it is not initialized on skb allocation.
This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
the series.
It is possible to add support for extensions that are not preseved on
clones/copies:
1. define a bitmask of all extensions that need copy/cow on clone
2. change __skb_ext_copy() to check
->active_extensions & SKB_EXT_PRESERVE_ON_CLONE
3. set clone->active_extensions to 0 if test is false.
This isn't done here because all extensions that get added here
need the copy/cow semantics.
Last patch converts skb->sp, secpath information gets stored as
new SKB_EXT_SEC_PATH, so the 'sp' pointer is removed from skbuff.
Extra code added to skb clone and free paths (to deal with refcount/free
of extension area) replaces the existing code that does the same for
skb->nf_bridge and skb->secpath.
I don't see any other in-tree users that could benefit from this
infrastructure, it doesn't make sense to add an extension just for the sake
of a single flag bit (like skb->nf_trace).
Adding a new extension is a good fit if all of the following are true:
1. Data is related to the skb/packet aggregate
2. Data should be freed when the skb is free'd
3. Data is not going to be relevant/needed in normal case (udp, tcp,
forwarding workloads, ...)
4. There are no fancy action(s) needed on clone/free, such as callbacks
into kernel modules.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove skb->sp and allocate secpath storage via extension
infrastructure. This also reduces sk_buff by 8 bytes on x86_64.
Total size of allyesconfig kernel is reduced slightly, as there is
less inlined code (one conditional atomic op instead of two on
skb_clone).
No differences in throughput in following ipsec performance tests:
- transport mode with aes on 10GB link
- tunnel mode between two network namespaces with aes and null cipher
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
secpath_set is a wrapper for secpath_dup that will not perform
an allocation if the secpath attached to the skb has a reference count
of one, i.e., it doesn't need to be COW'ed.
Also, secpath_dup doesn't attach the secpath to the skb, it leaves
this to the caller.
Use secpath_set in places that immediately assign the return value to
skb.
This allows to remove skb->sp without touching these spots again.
secpath_dup can eventually be removed in followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
reduce noise when skb->sp is removed later in the series.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Will reduce noise when skb->sp is removed later in this series.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
... so this won't have to be changed when skb->sp goes away.
v2: no changes, preserve ack.
Acked-by: Shannon Nelson <shannon.lee.nelson@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Will avoid touching this when sp pointer is removed from sk_buff struct.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>