A node that has the MRA role, it can behave as MRM or MRC.
Initially it starts as MRM and sends MRP_Test frames on both ring ports.
If it detects that there are MRP_Test send by another MRM, then it
checks if these frames have a lower priority than itself. In this case
it would send MRP_Nack frames to notify the other node that it needs to
stop sending MRP_Test frames.
If it receives a MRP_Nack frame then it stops sending MRP_Test frames
and starts to behave as a MRC but it would continue to monitor the
MRP_Test frames send by MRM. If at a point the MRM stops to send
MRP_Test frames it would get the MRM role and start to send MRP_Test
frames.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each MRP instance has a priority, a lower value means a higher priority.
The priority of MRP instance is stored in MRP_Test frame in this way
all the MRP nodes in the ring can see other nodes priority.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace u16/u32 with be16/be32 in the MRP frame types.
This fixes sparse warnings like:
warning: cast to restricted __be16
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This type is used for traps that trap control packets such as ARP
request and IGMP query to the CPU.
Do not report such packets to the kernel's drop monitor as they were not
dropped by the device no encountered an exception during forwarding.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The action is used by control traps such as IGMP query. The packet is
flooded by the device, but also trapped to the CPU in order for the
software bridge to mark the receiving port as a multicast router port.
Such packets are marked with 'skb->offload_fwd_mark = 1' in order to
prevent the software bridge from flooding them again.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next
to extend ctnetlink and the flowtable infrastructure:
1) Extend ctnetlink kernel side netlink dump filtering capabilities,
from Romain Bellan.
2) Generalise the flowtable hook parser to take a hook list.
3) Pass a hook list to the flowtable hook registration/unregistration.
4) Add a helper function to release the flowtable hook list.
5) Update the flowtable event notifier to pass a flowtable hook list.
6) Allow users to add new devices to an existing flowtables.
7) Allow users to remove devices to an existing flowtables.
8) Allow for registering a flowtable with no initial devices.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.
The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.
Signed-off-by: David S. Miller <davem@davemloft.net>
* many 6 GHz changes, though it's not _quite_ complete
(I left out scanning for now, we're still discussing)
* allow userspace SA-query processing for operating channel
validation
* TX status for control port TX, for AP-side operation
* more per-STA/TID control options
* move to kHz for channels, for future S1G operation
* various other small changes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl7TfecACgkQB8qZga/f
l8Tx6hAAgRizfdHb9xxp001AAzKnsdU46srOKOhwV2d6w+S+qHbLtwa0Xz43pBvX
LxpQs7dBQBLYh11xJhDlKY6duYV989xGcsHm7suO43jbjDo8KXfz4MaP65em6EKt
pdD0mD1sKkfR4FhYNbUEe8Ug/185jdk+gX+aI1Nrz6XlkUoiY+czSnGFyAvpvau2
I+NGqyKG5D6ureq7p7dQcgN+t2D4Ou9stVhpQ+jP0Ep720gvfTEzeFuMJbb3JZ1y
KSgOOWS1HQj1FdlJDs3KAmgUXpkU/lxZhNxl06MMYo3tB7Y0vmLoy/ZNcb5eW4Sw
a0SHgG5yhDysCyINz6q7llG3esDcppGiNuMjd/qR2qPOZPHNtlYaHtcoKBcKdS0k
03DyURZpA0B33cr9FTV8tXaM7IMY/2qaq/DqkeNtuDzGdh4jEwkVJ4fNtUAdgcOv
4JEz3A7fY3isy8tzi7Dom4U/2hR1di5gZloAC5PPYRvnbmY9HoIqG06k1Wtn1Yj4
pbquqvdJ5ONcaAaXz7zVQUZm1JzrK81Pl3pdih7USasc8z2MEzWQPSR+hxtwG5TY
KbDI1Nel8ZLbL2MWDakh3+lPoJAMuyadRlVVWEMj4l/afYHgcy5hEbaMbaZnxmAg
G4I6R5JZTJZuVdKi/U/Q9n7jR83qfIRNbxMLY8HFZ4caJ5qhZGs=
=wdaG
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Another set of changes, including
* many 6 GHz changes, though it's not _quite_ complete
(I left out scanning for now, we're still discussing)
* allow userspace SA-query processing for operating channel
validation
* TX status for control port TX, for AP-side operation
* more per-STA/TID control options
* move to kHz for channels, for future S1G operation
* various other small changes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With some newer AKMs, the KCK and KEK are bigger, so allow that
if the driver advertises support for it. In addition, add a new
attribute for the AKM so we can use it for offloaded rekeying.
Signed-off-by: Nathan Errera <nathan.errera@intel.com>
[reword commit message]
Link: https://lore.kernel.org/r/20200528212237.5eb58b00a5d1.I61b09d77c4f382e8d58a05dcca78096e99a6bc15@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
These capabilities cover what would otherwise be transported
in HT/VHT capabilities, but only a subset thereof that is
actually needed on 6 GHz with HE already present. Expose the
capabilities to userspace, drivers are expected to set them
as using the 6 GHz band (currently) requires HE capability.
Link: https://lore.kernel.org/r/20200528213443.244cd5cb9db8.Icd8c773277a88c837e7e3af1d4d1013cc3b66543@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Steffen Klassert says:
====================
pull request (net): ipsec 2020-05-29
1) Several fixes for ESP gro/gso in transport and beet mode when
IPv6 extension headers are present. From Xin Long.
2) Fix a wrong comment on XFRMA_OFFLOAD_DEV.
From Antony Antony.
3) Fix sk_destruct callback handling on ESP in TCP encapsulation.
From Sabrina Dubroca.
4) Fix a use after free in xfrm_output_gso when used with vxlan.
From Xin Long.
5) Fix secpath handling of VTI when used wiuth IPCOMP.
From Xin Long.
6) Fix an oops when deleting a x-netns xfrm interface.
From Nicolas Dichtel.
7) Fix a possible warning on policy updates. We had a case where it was
possible to add two policies with the same lookup keys.
From Xin Long.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conntrack dump does not support kernel side filtering (only get exists,
but it returns only one entry. And user has to give a full valid tuple)
It means that userspace has to implement filtering after receiving many
irrelevant entries, consuming resources (conntrack table is sometimes
very huge, much more than a routing table for example).
This patch adds filtering in kernel side. To achieve this goal, we:
* Add a new CTA_FILTER netlink attributes, actually a flag list to
parametize filtering
* Convert some *nlattr_to_tuple() functions, to allow a partial parsing
of CTA_TUPLE_ORIG and CTA_TUPLE_REPLY (so nf_conntrack_tuple it not
fully set)
Filtering is now possible on:
* IP SRC/DST values
* Ports for TCP and UDP flows
* IMCP(v6) codes types and IDs
Filtering is done as an "AND" operator. For example, when flags
PROTO_SRC_PORT, PROTO_NUM and IP_SRC are sets, only entries matching all
values are dumped.
Changes since v1:
Set NLM_F_DUMP_FILTERED in nlm flags if entries are filtered
Changes since v2:
Move several constants to nf_internals.h
Move a fix on netlink values check in a separate patch
Add a check on not-supported flags
Return EOPNOTSUPP if CDA_FILTER is set in ctnetlink_flush_conntrack
(not yet implemented)
Code style issues
Changes since v3:
Fix compilation warning reported by kbuild test robot
Changes since v4:
Fix a regression introduced in v3 (returned EINVAL for valid netlink
messages without CTA_MARK)
Changes since v5:
Change definition of CTA_FILTER_F_ALL
Fix a regression when CTA_TUPLE_ZONE is not set
Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch reworks the MRP netlink interface. Before, each attribute
represented a binary structure which made it hard to be extended.
Therefore update the MRP netlink interface such that each existing
attribute to be a nested attribute which contains the fields of the
binary structures.
In this way the MRP netlink interface can be extended without breaking
the backwards compatibility. It is also using strict checking for
attributes under the MRP top attribute.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When compiling inside the kernel include linux/stddef.h instead of
stddef.h. When I compile this header file in backports for power PC I
run into a conflict with ptrdiff_t. I was unable to reproduce this in
mainline kernel. I still would like to fix this problem in the kernel.
Fixes: 6989310f5d ("wireless: Use offsetof instead of custom macro.")
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Link: https://lore.kernel.org/r/20200521201422.16493-1-hauke@hauke-m.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch adds support to configure per TID Tx Rate configuration
through NL80211_TID_CONFIG_ATTR_TX_RATE* attributes. And it uses
nl80211_parse_tx_bitrate_mask api to validate the Tx rate mask.
Signed-off-by: Tamizh Chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1589357504-10175-1-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This adds the necessary capabilities in nl80211 to allow drivers to
assign a cookie to control port TX frames (returned via extack in
the netlink ACK message of the command) and then later report the
frame's status.
Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200508144202.7678-2-markus.theil@tu-ilmenau.de
[use extack cookie instead of explicit message, recombine patches]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the driver advertises NL80211_EXT_FEATURE_SCAN_FREQ_KHZ
userspace can omit NL80211_ATTR_SCAN_FREQUENCIES in favor
of an NL80211_ATTR_SCAN_FREQ_KHZ. To get scan results in
KHz userspace must also set the
NL80211_SCAN_FLAG_FREQ_KHZ.
This lets nl80211 remain compatible with older userspaces
while not requring and sending redundant (and potentially
incorrect) scan frequency sets.
Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200430172554.18383-4-thomas@adapt-ip.com
[use just nla_nest_start() (not _noflag) for NL80211_ATTR_SCAN_FREQ_KHZ]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
cfg80211 recently gained the ability to understand a
frequency offset component in KHz. Expose this in nl80211
through the new attributes NL80211_ATTR_WIPHY_FREQ_OFFSET,
NL80211_FREQUENCY_ATTR_OFFSET,
NL80211_ATTR_CENTER_FREQ1_OFFSET, and
NL80211_BSS_FREQUENCY_OFFSET.
These add support to send and receive a KHz offset
component with the following NL80211 commands:
- NL80211_CMD_FRAME
- NL80211_CMD_GET_SCAN
- NL80211_CMD_AUTHENTICATE
- NL80211_CMD_ASSOCIATE
- NL80211_CMD_CONNECT
Along with any other command which takes a chandef, ie:
- NL80211_CMD_SET_CHANNEL
- NL80211_CMD_SET_WIPHY
- NL80211_CMD_START_AP
- NL80211_CMD_RADAR_DETECT
- NL80211_CMD_NOTIFY_RADAR
- NL80211_CMD_CHANNEL_SWITCH
- NL80211_JOIN_IBSS
- NL80211_CMD_REMAIN_ON_CHANNEL
- NL80211_CMD_JOIN_OCB
- NL80211_CMD_JOIN_MESH
- NL80211_CMD_TDLS_CHANNEL_SWITCH
If the driver advertises a band containing channels with
frequency offset, it must also verify support for
frequency offset channels in its cfg80211 ops, or return
an error.
Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200430172554.18383-3-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Current rule for applying TID configuration for specific peer looks overly
complicated. No need to reject new TID configuration when override flag is
specified. Another call with the same TID configuration, but without
override flag, allows to apply new configuration anyway.
Use the same approach as for the 'all peers' case: if override flag is
specified, then reset existing TID configuration and immediately
apply a new one.
Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200424112905.26770-5-sergey.matyukevich.os@quantenna.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Allow the user to configure where on the cable the TDR data should be
retrieved, in terms of first and last sample, and the step between
samples. Also add the ability to ask for TDR data for just one pair.
If this configuration is not provided, it defaults to 1-150m at 1m
intervals for all pairs.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
v3:
Move the TDR configuration into a structure
Add a range check on step
Use NL_SET_ERR_MSG_ATTR() when appropriate
Move TDR configuration into a nest
Document attributes in the request
Signed-off-by: David S. Miller <davem@davemloft.net>
Some Ethernet PHYs can return the raw time domain reflectromatry data.
Add the attributes to allow this data to be requested and returned via
netlink ethtool.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
v2:
m -> cm
Report what the PHY actually used for start/stop/step.
Signed-off-by: David S. Miller <davem@davemloft.net>
* hwsim improvements from Jouni and myself, to be able to
test more scenarios easily
* some more HE (802.11ax) support
* some initial S1G (sub 1 GHz) work for fractional MHz channels
* some (action) frame registration updates to help DPP support
* along with other various improvements/fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl7L1A8ACgkQB8qZga/f
l8T5RQ/9EUxFqBs0cWojwyad5nkesyl51eOnbvSCJJF14W93s2oMeikCynTPe8Vg
km36041QZqGbwmU0yWC9Lmm4y3ja5qQGI+QW+vT6tutGQx6FgK5TzUfYXqiFZqf6
asqkvHpH4VqmbG1KEp0PZjIpW/OVK96pbvtXVnkrcMmjl2JjbRtAhyZQVNtt9ufJ
6wqKf8e6iYqMIInMFPLX+rl7UEknxDKVcqPbMMJmY8/iM1z9Elkg3rkRSMehC+mE
8cznZ6BsjAGCbMiA8K9fUo15lcMfZCJ1hAPzkD4TsJtMEJ0gYDo5jDB8TIpr5uoL
95OnlF8jokJIsO+1g4CyaNSQsmFIuDo84vW8LtGRu9qzTP0UwelxhjZLgE3xlP6b
W+z5HomxfWkYhJhaNywLP3B1VPtJwX8dL/wpECOWHzNKXG7Rb6GqzUwaCRFb6Jjo
TmFJ5wLoEZHhsXYO2dvcyTzCUCXviXvfq60a56IyCJN8wDqmcubePv0+NOHUmj3c
E71NTYymM3j9agdSpXdCFLBXA1OgyIydeSNHuBlaPA4sK6tr4ikUtbOrABjYTaQz
2BB5fHEi8gs4EiHbSXqLFBot3JHljKJPsSN0wAgzQffN+6Kts9FG6HcrLsL+duDg
lRdAzRrunE85S0QhsxeVIX216rX4W08sl0B1rJR+dTMX9ByblAk=
=MVBJ
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-net-next-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
One batch of changes, containing:
* hwsim improvements from Jouni and myself, to be able to
test more scenarios easily
* some more HE (802.11ax) support
* some initial S1G (sub 1 GHz) work for fractional MHz channels
* some (action) frame registration updates to help DPP support
* along with other various improvements/fixes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With struct flow_dissector_key_mpls now recording the first
FLOW_DIS_MPLS_MAX labels, we can extend Flower to filter on any of
these LSEs independently.
In order to avoid creating new netlink attributes for every possible
depth, let's define a new TCA_FLOWER_KEY_MPLS_OPTS nested attribute
that contains the list of LSEs to match. Each LSE is represented by
another attribute, TCA_FLOWER_KEY_MPLS_OPTS_LSE, which then contains
the attributes representing the depth and the MPLS fields to match at
this depth (label, TTL, etc.).
For each MPLS field, the mask is always set to all-ones, as this is
what the original API did. We could allow user configurable masks in
the future if there is demand for more flexibility.
The new API also allows to only specify an LSE depth. In that case,
Flower only verifies that the MPLS label stack depth is greater or
equal to the provided depth (that is, an LSE exists at this depth).
Filters that only match on one (or more) fields of the first LSE are
dumped using the old netlink attributes, to avoid confusing user space
programs that don't understand the new API.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-05-23
The following pull-request contains BPF updates for your *net-next* tree.
We've added 50 non-merge commits during the last 8 day(s) which contain
a total of 109 files changed, 2776 insertions(+), 2887 deletions(-).
The main changes are:
1) Add a new AF_XDP buffer allocation API to the core in order to help
lowering the bar for drivers adopting AF_XDP support. i40e, ice, ixgbe
as well as mlx5 have been moved over to the new API and also gained a
small improvement in performance, from Björn Töpel and Magnus Karlsson.
2) Add getpeername()/getsockname() attach types for BPF sock_addr programs
in order to allow for e.g. reverse translation of load-balancer backend
to service address/port tuple from a connected peer, from Daniel Borkmann.
3) Improve the BPF verifier is_branch_taken() logic to evaluate pointers
being non-NULL, e.g. if after an initial test another non-NULL test on
that pointer follows in a given path, then it can be pruned right away,
from John Fastabend.
4) Larger rework of BPF sockmap selftests to make output easier to understand
and to reduce overall runtime as well as adding new BPF kTLS selftests
that run in combination with sockmap, also from John Fastabend.
5) Batch of misc updates to BPF selftests including fixing up test_align
to match verifier output again and moving it under test_progs, allowing
bpf_iter selftest to compile on machines with older vmlinux.h, and
updating config options for lirc and v6 segment routing helpers, from
Stanislav Fomichev, Andrii Nakryiko and Alan Maguire.
6) Conversion of BPF tracing samples outdated internal BPF loader to use
libbpf API instead, from Daniel T. Lee.
7) Follow-up to BPF kernel test infrastructure in order to fix a flake in
the XDP selftests, from Jesper Dangaard Brouer.
8) Minor improvements to libbpf's internal hashmap implementation, from
Ian Rogers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Todays vxlan mac fdb entries can point to multiple remote
ips (rdsts) with the sole purpose of replicating
broadcast-multicast and unknown unicast packets to those remote ips.
E-VPN multihoming [1,2,3] requires bridged vxlan traffic to be
load balanced to remote switches (vteps) belonging to the
same multi-homed ethernet segment (E-VPN multihoming is analogous
to multi-homed LAG implementations, but with the inter-switch
peerlink replaced with a vxlan tunnel). In other words it needs
support for mac ecmp. Furthermore, for faster convergence, E-VPN
multihoming needs the ability to update fdb ecmp nexthops independent
of the fdb entries.
New route nexthop API is perfect for this usecase.
This patch extends the vxlan fdb code to take a nexthop id
pointing to an ecmp nexthop group.
Changes include:
- New NDA_NH_ID attribute for fdbs
- Use the newly added fdb nexthop groups
- makes vxlan rdsts and nexthop handling code mutually
exclusive
- since this is a new use-case and the requirement is for ecmp
nexthop groups, the fdb add and update path checks that the
nexthop is really an ecmp nexthop group. This check can be relaxed
in the future, if we want to introduce replication fdb nexthop groups
and allow its use in lieu of current rdst lists.
- fdb update requests with nexthop id's only allowed for existing
fdb's that have nexthop id's
- learning will not override an existing fdb entry with nexthop
group
- I have wrapped the switchdev offload code around the presence of
rdst
[1] E-VPN RFC https://tools.ietf.org/html/rfc7432
[2] E-VPN with vxlan https://tools.ietf.org/html/rfc8365
[3] http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf
Includes a null check fix in vxlan_xmit from Nikolay
v2 - Fixed build issue:
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces ecmp nexthops and nexthop groups
for mac fdb entries. In subsequent patches this is used
by the vxlan driver fdb entries. The use case is
E-VPN multihoming [1,2,3] which requires bridged vxlan traffic
to be load balanced to remote switches (vteps) belonging to
the same multi-homed ethernet segment (This is analogous to
a multi-homed LAG but over vxlan).
Changes include new nexthop flag NHA_FDB for nexthops
referenced by fdb entries. These nexthops only have ip.
This patch includes appropriate checks to avoid routes
referencing such nexthops.
example:
$ip nexthop add id 12 via 172.16.1.2 fdb
$ip nexthop add id 13 via 172.16.1.3 fdb
$ip nexthop add id 102 group 12/13 fdb
$bridge fdb add 02:02:00:00:00:13 dev vxlan1000 nhid 101 self
[1] E-VPN https://tools.ietf.org/html/rfc7432
[2] E-VPN VxLAN: https://tools.ietf.org/html/rfc8365
[3] LPC talk with mention of nexthop groups for L2 ecmp
http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf
v4 - fixed uninitialized variable reported by kernel test robot
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, psample can only send the packet bits after decapsulation.
The tunnel information is lost. Add the tunnel support.
If the sampled packet has no tunnel info, the behavior is the same as
before. If it has, add a nested metadata field named PSAMPLE_ATTR_TUNNEL
and include the tunnel subfields if applicable.
Increase the metadata length for sampled packet with the tunnel info.
If new subfields of tunnel info should be included, update the metadata
length accordingly.
Signed-off-by: Chris Mi <chrism@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As stated in 983695fa67 ("bpf: fix unconnected udp hooks"), the objective
for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
transparent to applications. In Cilium we make use of these hooks [0] in
order to enable E-W load balancing for existing Kubernetes service types
for all Cilium managed nodes in the cluster. Those backends can be local
or remote. The main advantage of this approach is that it operates as close
as possible to the socket, and therefore allows to avoid packet-based NAT
given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
This also allows to expose NodePort services on loopback addresses in the
host namespace, for example. As another advantage, this also efficiently
blocks bind requests for applications in the host namespace for exposed
ports. However, one missing item is that we also need to perform reverse
xlation for inet{,6}_getname() hooks such that we can return the service
IP/port tuple back to the application instead of the remote peer address.
The vast majority of applications does not bother about getpeername(), but
in a few occasions we've seen breakage when validating the peer's address
since it returns unexpectedly the backend tuple instead of the service one.
Therefore, this trivial patch allows to customise and adds a getpeername()
as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
to address this situation.
Simple example:
# ./cilium/cilium service list
ID Frontend Service Type Backend
1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80
Before; curl's verbose output example, no getpeername() reverse xlation:
# curl --verbose 1.2.3.4
* Rebuilt URL to: 1.2.3.4/
* Trying 1.2.3.4...
* TCP_NODELAY set
* Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
> GET / HTTP/1.1
> Host: 1.2.3.4
> User-Agent: curl/7.58.0
> Accept: */*
[...]
After; with getpeername() reverse xlation:
# curl --verbose 1.2.3.4
* Rebuilt URL to: 1.2.3.4/
* Trying 1.2.3.4...
* TCP_NODELAY set
* Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
> GET / HTTP/1.1
> Host: 1.2.3.4
> User-Agent: curl/7.58.0
> Accept: */*
[...]
Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
peer to the context similar as in inet{,6}_getname() fashion, but API-wise
this is suboptimal as it always enforces programs having to test for ctx->peer
which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
Similarly, the checked return code is on tnum_range(1, 1), but if a use case
comes up in future, it can easily be changed to return an error code instead.
Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
[0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-05-15
The following pull-request contains BPF updates for your *net-next* tree.
We've added 37 non-merge commits during the last 1 day(s) which contain
a total of 67 files changed, 741 insertions(+), 252 deletions(-).
The main changes are:
1) bpf_xdp_adjust_tail() now allows to grow the tail as well, from Jesper.
2) bpftool can probe CONFIG_HZ, from Daniel.
3) CAP_BPF is introduced to isolate user processes that use BPF infra and
to secure BPF networking services by dropping CAP_SYS_ADMIN requirement
in certain cases, from Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new TCA_DUMP_FLAGS attribute and use it in cls API to request terse
filter output from classifiers with TCA_DUMP_FLAGS_TERSE flag. This option
is intended to be used to improve performance of TC filter dump when
userland only needs to obtain stats and not the whole classifier/action
data. Extend struct tcf_proto_ops with new terse_dump() callback that must
be defined by supporting classifier implementations.
Support of the options in specific classifiers and actions is
implemented in following patches in the series.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split BPF operations that are allowed under CAP_SYS_ADMIN into
combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN.
For backward compatibility include them in CAP_SYS_ADMIN as well.
The end result provides simple safety model for applications that use BPF:
- to load tracing program types
BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc}
use CAP_BPF and CAP_PERFMON
- to load networking program types
BPF_PROG_TYPE_{SCHED_CLS, XDP, SK_SKB, etc}
use CAP_BPF and CAP_NET_ADMIN
There are few exceptions from this rule:
- bpf_trace_printk() is allowed in networking programs, but it's using
tracing mechanism, hence this helper needs additional CAP_PERFMON
if networking program is using this helper.
- BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only
to discourage production use.
- BPF HW offload is allowed under CAP_SYS_ADMIN.
- bpf_probe_write_user() is allowed under CAP_SYS_ADMIN only.
CAPs are not checked at attach/detach time with two exceptions:
- loading BPF_PROG_TYPE_CGROUP_SKB is allowed for unprivileged users,
hence CAP_NET_ADMIN is required at attach time.
- flow_dissector detach doesn't check prog FD at detach,
hence CAP_NET_ADMIN is required at detach time.
CAP_SYS_ADMIN is required to iterate BPF objects (progs, maps, links) via get_next_id
command and convert them to file descriptor via GET_FD_BY_ID command.
This restriction guarantees that mutliple tasks with CAP_BPF are not able to
affect each other. That leads to clean isolation of tasks. For example:
task A with CAP_BPF and CAP_NET_ADMIN loads and attaches a firewall via bpf_link.
task B with the same capabilities cannot detach that firewall unless
task A explicitly passed link FD to task B via scm_rights or bpffs.
CAP_SYS_ADMIN can still detach/unload everything.
Two networking user apps with CAP_SYS_ADMIN and CAP_NET_ADMIN can
accidentely mess with each other programs and maps.
Two networking user apps with CAP_NET_ADMIN and CAP_BPF cannot affect each other.
CAP_NET_ADMIN + CAP_BPF allows networking programs access only packet data.
Such networking progs cannot access arbitrary kernel memory or leak pointers.
bpftool, bpftrace, bcc tools binaries should NOT be installed with
CAP_BPF and CAP_PERFMON, since unpriv users will be able to read kernel secrets.
But users with these two permissions will be able to use these tracing tools.
CAP_PERFMON is least secure, since it allows kprobes and kernel memory access.
CAP_NET_ADMIN can stop network traffic via iproute2.
CAP_BPF is the safest from security point of view and harmless on its own.
Having CAP_BPF and/or CAP_NET_ADMIN is not enough to write into arbitrary map
and if that map is used by firewall-like bpf prog.
CAP_BPF allows many bpf prog_load commands in parallel. The verifier
may consume large amount of memory and significantly slow down the system.
Existing unprivileged BPF operations are not affected.
In particular unprivileged users are allowed to load socket_filter and cg_skb
program types and to create array, hash, prog_array, map-in-map map types.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-2-alexei.starovoitov@gmail.com
Finally, after all drivers have a frame size, allow BPF-helper
bpf_xdp_adjust_tail() to grow or extend packet size at frame tail.
Remember that helper/macro xdp_data_hard_end have reserved some
tailroom. Thus, this helper makes sure that the BPF-prog don't have
access to this tailroom area.
V2: Remove one chicken check and use WARN_ONCE for other
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945348530.97035.12577148209134239291.stgit@firesoul
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-05-14
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON.
2) support for narrow loads in bpf_sock_addr progs and additional
helpers in cg-skb progs, from Andrey.
3) bpf benchmark runner, from Andrii.
4) arm and riscv JIT optimizations, from Luke.
5) bpf iterator infrastructure, from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With having ability to lookup sockets in cgroup skb programs it becomes
useful to access cgroup id of retrieved sockets so that policies can be
implemented based on origin cgroup of such socket.
For example, a container running in a cgroup can have cgroup skb ingress
program that can lookup peer socket that is sending packets to a process
inside the container and decide whether those packets should be allowed
or denied based on cgroup id of the peer.
More specifically such ingress program can implement intra-host policy
"allow incoming packets only from this same container and not from any
other container on same host" w/o relying on source IP addresses since
quite often it can be the case that containers share same IP address on
the host.
Introduce two new helpers for this use-case: bpf_sk_cgroup_id() and
bpf_sk_ancestor_cgroup_id().
These helpers are similar to existing bpf_skb_{,ancestor_}cgroup_id
helpers with the only difference that sk is used to get cgroup id
instead of skb, and share code with them.
See documentation in UAPI for more details.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/f5884981249ce911f63e9b57ecd5d7d19154ff39.1589486450.git.rdna@fb.com
bpf_sock_addr.user_port supports only 4-byte load and it leads to ugly
code in BPF programs, like:
volatile __u32 user_port = ctx->user_port;
__u16 port = bpf_ntohs(user_port);
Since otherwise clang may optimize the load to be 2-byte and it's
rejected by verifier.
Add support for 1- and 2-byte loads same way as it's supported for other
fields in bpf_sock_addr like user_ip4, msg_src_ip4, etc.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/c1e983f4c17573032601d0b2b1f9d1274f24bc16.1589420814.git.rdna@fb.com
Raw Gadget is currently unable to stall/halt/wedge gadget endpoints,
which is required for proper emulation of certain USB classes.
This patch adds a few more ioctls:
- USB_RAW_IOCTL_EP0_STALL allows to stall control endpoint #0 when
there's a pending setup request for it.
- USB_RAW_IOCTL_SET/CLEAR_HALT/WEDGE allow to set/clear halt/wedge status
on non-control non-isochronous endpoints.
Fixes: f2c2e71764 ("usb: gadget: add raw-gadget interface")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Felipe Balbi <balbi@kernel.org>
Currently automatic gadget endpoint selection based on required features
doesn't work. Raw Gadget tries iterating over the list of available
endpoints and finding one that has the right direction and transfer type.
Unfortunately selecting arbitrary gadget endpoints (even if they satisfy
feature requirements) doesn't work, as (depending on the UDC driver) they
might have fixed addresses, and one also needs to provide matching
endpoint addresses in the descriptors sent to the host.
The composite framework deals with this by assigning endpoint addresses
in usb_ep_autoconfig() before enumeration starts. This approach won't work
with Raw Gadget as the endpoints are supposed to be enabled after a
set_configuration/set_interface request from the host, so it's too late to
patch the endpoint descriptors that had already been sent to the host.
For Raw Gadget we take another approach. Similarly to GadgetFS, we allow
the user to make the decision as to which gadget endpoints to use.
This patch adds another Raw Gadget ioctl USB_RAW_IOCTL_EPS_INFO that
exposes information about all non-control endpoints that a currently
connected UDC has. This information includes endpoints addresses, as well
as their capabilities and limits to allow the user to choose the most
fitting gadget endpoint.
The USB_RAW_IOCTL_EP_ENABLE ioctl is updated to use the proper endpoint
validation routine usb_gadget_ep_match_desc().
These changes affect the portability of the gadgets that use Raw Gadget
when running on different UDCs. Nevertheless, as long as the user relies
on the information provided by USB_RAW_IOCTL_EPS_INFO to dynamically
choose endpoint addresses, UDC-agnostic gadgets can still be written with
Raw Gadget.
Fixes: f2c2e71764 ("usb: gadget: add raw-gadget interface")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Felipe Balbi <balbi@kernel.org>
Minor improvements to the documentation for BPF helpers:
* Fix formatting for the description of "bpf_socket" for
bpf_getsockopt() and bpf_setsockopt(), thus suppressing two warnings
from rst2man about "Unexpected indentation".
* Fix formatting for return values for bpf_sk_assign() and seq_file
helpers.
* Fix and harmonise formatting, in particular for function/struct names.
* Remove blank lines before "Return:" sections.
* Replace tabs found in the middle of text lines.
* Fix typos.
* Add a note to the footer (in Python script) about "bpftool feature
probe", including for listing features available to unprivileged
users, and add a reference to bpftool man page.
Thanks to Florian for reporting two typos (duplicated words).
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200511161536.29853-4-quentin@isovalent.com
Add the attributes needed to report cable test results to userspace.
The reports are expected to be per twisted pair. A nested property per
pair can report the result of the cable test. A nested property can
also report the length of the cable to any fault.
v2:
Grammar fixes
Change length from u16 to u32
s/DEV/HEADER/g
Add status attributes
Rename pairs from numbers to letters.
v3:
Fixed example in document
Add ETHTOOL_A_CABLE_NEST_* enum
Add ETHTOOL_MSG_CABLE_TEST_NTF to documentation
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add new ethtool netlink calls to trigger the starting of a PHY cable
test.
Add Kconfig'ury to ETHTOOL_NETLINK so that PHYLIB is not a module when
ETHTOOL_NETLINK is builtin, which would result in kernel linking errors.
v2:
Remove unwanted white space change
Remove ethnl_cable_test_act_ops and use doit handler
Rename cable_test_set_policy cable_test_act_policy
Remove ETHTOOL_MSG_CABLE_TEST_ACT_REPLY
v3:
Remove ETHTOOL_MSG_CABLE_TEST_ACT_REPLY from documentation
Remove unused cable_test_get_policy
Add Reviewed-by tags
v4:
Remove unwanted blank line
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Two helpers bpf_seq_printf and bpf_seq_write, are added for
writing data to the seq_file buffer.
bpf_seq_printf supports common format string flag/width/type
fields so at least I can get identical results for
netlink and ipv6_route targets.
For bpf_seq_printf and bpf_seq_write, return value -EOVERFLOW
specifically indicates a write failure due to overflow, which
means the object will be repeated in the next bpf invocation
if object collection stays the same. Note that if the object
collection is changed, depending how collection traversal is
done, even if the object still in the collection, it may not
be visited.
For bpf_seq_printf, format %s, %p{i,I}{4,6} needs to
read kernel memory. Reading kernel memory may fail in
the following two cases:
- invalid kernel address, or
- valid kernel address but requiring a major fault
If reading kernel memory failed, the %s string will be
an empty string and %p{i,I}{4,6} will be all 0.
Not returning error to bpf program is consistent with
what bpf_trace_printk() does for now.
bpf_seq_printf may return -EBUSY meaning that internal percpu
buffer for memory copy of strings or other pointees is
not available. Bpf program can return 1 to indicate it
wants the same object to be repeated. Right now, this should not
happen on no-RT kernels since migrate_disable(), which guards
bpf prog call, calls preempt_disable().
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175914.2476661-1-yhs@fb.com
A new bpf command BPF_ITER_CREATE is added.
The anonymous bpf iterator is seq_file based.
The seq_file private data are referenced by targets.
The bpf_iter infrastructure allocated additional space
at seq_file->private before the space used by targets
to store some meta data, e.g.,
prog: prog to run
session_id: an unique id for each opened seq_file
seq_num: how many times bpf programs are queried in this session
done_stop: an internal state to decide whether bpf program
should be called in seq_ops->stop() or not
The seq_num will start from 0 for valid objects.
The bpf program may see the same seq_num more than once if
- seq_file buffer overflow happens and the same object
is retried by bpf_seq_read(), or
- the bpf program explicitly requests a retry of the
same object
Since module is not supported for bpf_iter, all target
registeration happens at __init time, so there is no
need to change bpf_iter_unreg_target() as it is used
mostly in error path of the init function at which time
no bpf iterators have been created yet.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175905.2475770-1-yhs@fb.com
Given a bpf program, the step to create an anonymous bpf iterator is:
- create a bpf_iter_link, which combines bpf program and the target.
In the future, there could be more information recorded in the link.
A link_fd will be returned to the user space.
- create an anonymous bpf iterator with the given link_fd.
The bpf_iter_link can be pinned to bpffs mount file system to
create a file based bpf iterator as well.
The benefit to use of bpf_iter_link:
- using bpf link simplifies design and implementation as bpf link
is used for other tracing bpf programs.
- for file based bpf iterator, bpf_iter_link provides a standard
way to replace underlying bpf programs.
- for both anonymous and free based iterators, bpf link query
capability can be leveraged.
The patch added support of tracing/iter programs for BPF_LINK_CREATE.
A new link type BPF_LINK_TYPE_ITER is added to facilitate link
querying. Currently, only prog_id is needed, so there is no
additional in-kernel show_fdinfo() and fill_link_info() hook
is needed for BPF_LINK_TYPE_ITER link.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175901.2475084-1-yhs@fb.com
A bpf_iter program is a tracing program with attach type
BPF_TRACE_ITER. The load attribute
attach_btf_id
is used by the verifier against a particular kernel function,
which represents a target, e.g., __bpf_iter__bpf_map
for target bpf_map which is implemented later.
The program return value must be 0 or 1 for now.
0 : successful, except potential seq_file buffer overflow
which is handled by seq_file reader.
1 : request to restart the same object
In the future, other return values may be used for filtering or
teminating the iterator.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175900.2474947-1-yhs@fb.com
We want to have a tighter control on what ports we bind to in
the BPF_CGROUP_INET{4,6}_CONNECT hooks even if it means
connect() becomes slightly more expensive. The expensive part
comes from the fact that we now need to call inet_csk_get_port()
that verifies that the port is not used and allocates an entry
in the hash table for it.
Since we can't rely on "snum || !bind_address_no_port" to prevent
us from calling POST_BIND hook anymore, let's add another bind flag
to indicate that the call site is BPF program.
v5:
* fix wrong AF_INET (should be AF_INET6) in the bpf program for v6
v3:
* More bpf_bind documentation refinements (Martin KaFai Lau)
* Add UDP tests as well (Martin KaFai Lau)
* Don't start the thread, just do socket+bind+listen (Martin KaFai Lau)
v2:
* Update documentation (Andrey Ignatov)
* Pass BIND_FORCE_ADDRESS_NO_PORT conditionally (Andrey Ignatov)
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200508174611.228805-5-sdf@google.com