Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-01-26
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) A number of extensions to tcp-bpf, from Lawrence.
- direct R or R/W access to many tcp_sock fields via bpf_sock_ops
- passing up to 3 arguments to bpf_sock_ops functions
- tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
- optionally calling bpf_sock_ops program when RTO fires
- optionally calling bpf_sock_ops program when packet is retransmitted
- optionally calling bpf_sock_ops program when TCP state changes
- access to tclass and sk_txhash
- new selftest
2) div/mod exception handling, from Daniel.
One of the ugly leftovers from the early eBPF days is that div/mod
operations based on registers have a hard-coded src_reg == 0 test
in the interpreter as well as in JIT code generators that would
return from the BPF program with exit code 0. This was basically
adopted from cBPF interpreter for historical reasons.
There are multiple reasons why this is very suboptimal and prone
to bugs. To name one: the return code mapping for such abnormal
program exit of 0 does not always match with a suitable program
type's exit code mapping. For example, '0' in tc means action 'ok'
where the packet gets passed further up the stack, which is just
undesirable for such cases (e.g. when implementing policy) and
also does not match with other program types.
After considering _four_ different ways to address the problem,
we adapt the same behavior as on some major archs like ARMv8:
X div 0 results in 0, and X mod 0 results in X. aarch64 and
aarch32 ISA do not generate any traps or otherwise aborts
of program execution for unsigned divides.
Given the options, it seems the most suitable from
all of them, also since major archs have similar schemes in
place. Given this is all in the realm of undefined behavior,
we still have the option to adapt if deemed necessary.
3) sockmap sample refactoring, from John.
4) lpm map get_next_key fixes, from Yonghong.
5) test cleanups, from Alexei and Prashant.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch adds support for openvswitch to configure erspan
v1 and v2. The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
to uapi as a binary blob to support all ERSPAN v1 and v2's
fields. Note that Previous commit "openvswitch: Add erspan tunnel
support." was reverted since it does not design properly.
Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch adds a new uapi header file, erspan.h, and moves
the 'struct erspan_metadata' from internal erspan.h to it.
Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.
There is a new enum in include/uapi/linux/bpf.h that exports the TCP
states that prepends BPF_ to the current TCP state names. If it is ever
necessary to change the internal TCP state values (other than adding
more to the end), then it will become necessary to convert from the
internal TCP state value to the BPF value before calling the BPF
sock_ops function. There are a set of compile checks added in tcp.c
to detect if the internal and BPF values differ so we can make the
necessary fixes.
New op: BPF_SOCK_OPS_STATE_CB.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adds support for calling sock_ops BPF program when there is a
retransmission. Three arguments are used; one for the sequence number,
another for the number of segments retransmitted, and the last one for
the return value of tcp_transmit_skb (0 => success).
Does not include syn-ack retransmissions.
New op: BPF_SOCK_OPS_RETRANS_CB.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set it, per connection and
as necessary, when the connection is established.
It also adds support for reading and writting the field within a
sock_ops BPF program. Reading is done by accessing the field directly.
However, writing is done through the helper function
bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
is trying to set a callback that is not supported in the current kernel
(i.e. running an older kernel). The helper function returns 0 if it was
able to set all of the bits set in the argument, a positive number
containing the bits that could not be set, or -EINVAL if the socket is
not a full TCP socket.
Examples of where one could call the bpf program:
1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
en_rx_am.c was deleted in 'net-next' but had a bug fixed in it in
'net'.
The esp{4,6}_offload.c conflicts were overlapping changes.
The 'out' label is removed so we just return ERR_PTR(-EINVAL)
directly.
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose the number of times the link has been going UP or DOWN, and
update the "carrier_changes" counter to be the sum of these two events.
While at it, also update the sysfs-class-net documentation to cover:
carrier_changes (3.15), carrier_up_count (4.16) and carrier_down_count
(4.16)
Signed-off-by: David Decotigny <decot@googlers.com>
[Florian:
* rebase
* add documentation
* merge carrier_changes with up/down counters]
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ccfdec9089 ("macsec: Add support for GCM-AES-256 cipher suite")
changed a few values in the uapi headers for MACsec.
Because of existing userspace implementations, we need to preserve the
value of MACSEC_DEFAULT_CIPHER_ID. Not doing that resulted in
wpa_supplicant segfaults when a secure channel was created using the
default cipher. Thus, swap MACSEC_DEFAULT_CIPHER_{ID,ALT} back to their
original values.
Changing the maximum length of the MACSEC_SA_ATTR_KEY attribute is
unnecessary, as the previous value (MACSEC_MAX_KEY_LEN, which was 128B)
is large enough to carry 32-bytes keys. This patch reverts
MACSEC_MAX_KEY_LEN to 128B and restores the old length check on
MACSEC_SA_ATTR_KEY.
Fixes: ccfdec9089 ("macsec: Add support for GCM-AES-256 cipher suite")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree. Basically, a new extension for ip6tables, simplification work of
nf_tables that saves us 500 LoC, allow raw table registration before
defragmentation, conversion of the SNMP helper to use the ASN.1 code
generator, unique 64-bit handle for all nf_tables objects and fixes to
address fallout from previous nf-next batch. More specifically, they
are:
1) Seven patches to remove family abstraction layer (struct nft_af_info)
in nf_tables, this simplifies our codebase and it saves us 64 bytes per
net namespace.
2) Add IPv6 segment routing header matching for ip6tables, from Ahmed
Abdelsalam.
3) Allow to register iptable_raw table before defragmentation, some
people do not want to waste cycles on defragmenting traffic that is
going to be dropped, hence add a new module parameter to enable this
behaviour in iptables and ip6tables. From Subash Abhinov
Kasiviswanathan. This patch needed a couple of follow up patches to
get things tidy from Arnd Bergmann.
4) SNMP helper uses the ASN.1 code generator, from Taehee Yoo. Several
patches for this helper to prepare this change are also part of this
patch series.
5) Add 64-bit handles to uniquely objects in nf_tables, from Harsha
Sharma.
6) Remove log message that several netfilter subsystems print at
boot/load time.
7) Restore x_tables module autoloading, that got broken in a previous
patch to allow singleton NAT hook callback registration per hook
spot, from Florian Westphal. Moreover, return EBUSY to report that
the singleton NAT hook slot is already in instead.
8) Several fixes for the new nf_tables flowtable representation,
including incorrect error check after nf_tables_flowtable_lookup(),
missing Kconfig dependencies that lead to build breakage and missing
initialization of priority and hooknum in flowtable object.
9) Missing NETFILTER_FAMILY_ARP dependency in Kconfig for the clusterip
target. This is due to recent updates in the core to shrink the hook
array size and compile it out if no specific family is enabled via
.config file. Patch from Florian Westphal.
10) Remove duplicated include header files, from Wei Yongjun.
11) Sparse warning fix for the NFPROTO_INET handling from the core
due to missing static function definition, also from Wei Yongjun.
12) Restore ICMPv6 Parameter Problem error reporting when
defragmentation fails, from Subash Abhinov Kasiviswanathan.
13) Remove obsolete owner field initialization from struct
file_operations, patch from Alexey Dobriyan.
14) Use boolean datatype where needed in the Netfilter codebase, from
Gustavo A. R. Silva.
15) Remove double semicolon in dynset nf_tables expression, from
Luis de Bethencourt.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-01-19
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) bpf array map HW offload, from Jakub.
2) support for bpf_get_next_key() for LPM map, from Yonghong.
3) test_verifier now runs loaded programs, from Alexei.
4) xdp cpumap monitoring, from Jesper.
5) variety of tests, cleanups and small x64 JIT optimization, from Daniel.
6) user space can now retrieve HW JITed program, from Jiong.
Note there is a minor conflict between Russell's arm32 JIT fixes
and removal of bpf_jit_enable variable by Daniel which should
be resolved by keeping Russell's comment and removing that variable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ARM:
* fix incorrect huge page mappings on systems using the contiguous hint
for hugetlbfs
* support alternative GICv4 init sequence
* correctly implement the ARM SMCC for HVC and SMC handling
PPC:
* add KVM IOCTL for reporting vulnerability and workaround status
s390:
* provide userspace interface for branch prediction changes in firmware
x86:
* use correct macros for bits
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJaY3/eAAoJEED/6hsPKofo64kH/16SCSA9pKJTf39+jLoCPzbp
tlhzxoaqb9cPNMQBAk8Cj5xNJ6V4Clwnk8iRWaE6dRI5nWQxnxRHiWxnrobHwUbK
I0zSy+SywynSBnollKzLzQrDUBZ72fv3oLwiYEYhjMvs0zW6Q/vg10WERbav912Q
bv8nb5e8TbvU500ErndKTXOa8/B6uZYkMVjBNvAHwb+4AQ7bJgDQs5/qOeXllm8A
MT/SNYop/fkjRP7mQng5XYzoO+70tbe0hWpOQGgBnduzrbkNNvZtYtovusHYytLX
PAB7DDPbLZm5L2HBo4zvKgTHIoHTxU0X2yfUDzt7O151O2WSyqBRC3y1tpj6xa8=
=GnNJ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"ARM:
- fix incorrect huge page mappings on systems using the contiguous
hint for hugetlbfs
- support alternative GICv4 init sequence
- correctly implement the ARM SMCC for HVC and SMC handling
PPC:
- add KVM IOCTL for reporting vulnerability and workaround status
s390:
- provide userspace interface for branch prediction changes in
firmware
x86:
- use correct macros for bits"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: wire up bpb feature
KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds
KVM/x86: Fix wrong macro references of X86_CR0_PG_BIT and X86_CR4_PAE_BIT in kvm_valid_sregs()
arm64: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls
KVM: arm64: Fix GICv4 init when called from vgic_its_create
KVM: arm/arm64: Check pagesize when allocating a hugepage at Stage 2
The new firmware interfaces for branch prediction behaviour changes
are transparently available for the guest. Nevertheless, there is
new state attached that should be migrated and properly resetted.
Provide a mechanism for handling reset, migration and VSIE.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
[Changed capability number to 152. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Guillaume Nault <g.nault@alphalink.fr>
Tested-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows deletion of objects via unique handle which can be
listed via '-a' option.
Signed-off-by: Harsha Sharma <harshasharmaiitr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This adds a new ioctl, KVM_PPC_GET_CPU_CHAR, that gives userspace
information about the underlying machine's level of vulnerability
to the recently announced vulnerabilities CVE-2017-5715,
CVE-2017-5753 and CVE-2017-5754, and whether the machine provides
instructions to assist software to work around the vulnerabilities.
The ioctl returns two u64 words describing characteristics of the
CPU and required software behaviour respectively, plus two mask
words which indicate which bits have been filled in by the kernel,
for extensibility. The bit definitions are the same as for the
new H_GET_CPU_CHARACTERISTICS hypercall.
There is also a new capability, KVM_CAP_PPC_GET_CPU_CHAR, which
indicates whether the new ioctl is available.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Tell user space about device on which the map was created.
Unfortunate reality of user ABI makes sharing this code
with program offload difficult but the information is the
same.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Doc BPF ld/ldx size defines as comments in code, as it makes in
faster to lookup in a programming/review setting, than looking up
the sizes in Documentation/networking/filter.txt.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEE4bay/IylYqM/npjQHv7KIOw4HPYFAlpeCCQTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRAe/sog7Dgc9ti9B/4jAoCYTuYXnkRvS34jdgQQV99QyC8M
EpcAgzHo2kYNPDf5q8TEVheXxvA9XeGiA+TtjL9cNowxwMMtJev3/dmBtOmU6jek
RgOsYlR8guxBHOx8pj1IMl1xPoWTLfg4Kv1qnpXx3zvhOP610G/edBBSZt659MGF
SW6pBoNivbl+EYSM5x81QIfqJ4NlD5AKY4PeeSnGrnSthd8EFNp2zKkcY8nMOJ0D
Gm2YyxGJXh+lHse965DQRNg+owZxIWyheQplulVrw9v34LOjbFko3Cd+D9KLW5MG
LVVRJ3E0jm7W75AvxNcv2WP+lZVcDxXqsxFH0dP8WOJNZiKZeJ5aZfji
=ROXk
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-4.16-20180116' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2018-01-16
this is a pull request for net-next/master consisting of 9 patches.
This is a series of patches, some of them initially by Franklin S Cooper
Jr, which was picked up by Faiz Abbas. Faiz Abbas added some patches
while working on this series, I contributed one as well.
The first two patches add support to CAN device infrastructure to limit
the bitrate of a CAN adapter if the used CAN-transceiver has a certain
maximum bitrate.
The remaining patches improve the m_can driver. They add support for
bitrate limiting to the driver, clean up the driver and add support for
runtime PM.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows userspace to attach eBPF filter to tun. This will
allow to implement VM dataplane filtering in a more efficient way
compared to cBPF filter by allowing either qemu or libvirt to
attach eBPF filter to tun.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce two new attributes to be used for qdisc creation and dumping.
One for ingress block, one for egress block. Introduce a set of ops that
qdisc which supports block sharing would implement.
Passing block indexes in qdisc change is not supported yet and it is
checked and forbidded.
In future, these attributes are to be reused for specifying block
indexes for classes as well. As of this moment however, it is not
supported so a check is in place to forbid it.
Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the tcm_ifindex with value TCM_IFINDEX_MAGIC_BLOCK is invalid ifindex,
use it to indicate that we work with block, instead of qdisc.
So if tcm_ifindex is set to TCM_IFINDEX_MAGIC_BLOCK, tcm_parent is used
to carry block_index.
If the block is set to be shared between at least 2 qdiscs, it is
forbidden to use the qdisc handle to add/delete filters. In that case,
userspace has to pass block_index.
Also, for dump of the filters, in case the block is shared in between at
least 2 qdiscs, the each filter is dumped with tcm_ifindex value
TCM_IFINDEX_MAGIC_BLOCK and tcm_parent set to block_index. That gives
the user clear indication, that the filter belongs to a shared block
and not only to one qdisc under which it is dumped.
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-01-17
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add initial BPF map offloading for nfp driver. Currently only
programs were supported so far w/o being able to access maps.
Offloaded programs are right now only allowed to perform map
lookups, and control path is responsible for populating the
maps. BPF core infrastructure along with nfp implementation is
provided, from Jakub.
2) Various follow-ups to Josef's BPF error injections. More
specifically that includes: properly check whether the error
injectable event is on function entry or not, remove the percpu
bpf_kprobe_override and rather compare instruction pointer
with original one, separate error-injection from kprobes since
it's not limited to it, add injectable error types in order to
specify what is the expected type of failure, and last but not
least also support the kernel's fault injection framework, all
from Masami.
3) Various misc improvements and cleanups to the libbpf Makefile.
That is, fix permissions when installing BPF header files, remove
unused variables and functions, and also install the libbpf.h
header, from Jesper.
4) When offloading to nfp JIT and the BPF insn is unsupported in the
JIT, then reject right at verification time. Also fix libbpf with
regards to ELF section name matching by properly treating the
program type as prefix. Both from Quentin.
5) Add -DPACKAGE to bpftool when including bfd.h for the disassembler.
This is needed, for example, when building libfd from source as
bpftool doesn't supply a config.h for bfd.h. Fix from Jiong.
6) xdp_convert_ctx_access() is simplified since it doesn't need to
set target size during verification, from Jesper.
7) Let bpftool properly recognize BPF_PROG_TYPE_CGROUP_DEVICE
program types, from Roman.
8) Various functions in BPF cpumap were not declared static, from Wei.
9) Fix a double semicolon in BPF samples, from Luis.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The hardware processes which are modeled via dpipe commonly use some
internal hardware resources. Such relation can improve the understanding
of hardware limitations. The number of resource's unit consumed per
table's entry are also provided for each table.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for performing driver hot reload.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for hardware resource abstraction over devlink. Each resource
is identified via id, furthermore it contains information regarding its
size and its related sub resources. Each resource can also provide its
current occupancy.
In some cases the sizes of some resources can be changed, yet for those
changes to take place a hot driver reload may be needed. The reload
capability will be introduced in the next patch.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various CAN or CAN-FD IP may be able to run at a faster rate than
what the transceiver the CAN node is connected to. This can lead to
unexpected errors. However, CAN transceivers typically have fixed
limitations and provide no means to discover these limitations at
runtime. Therefore, add support for a can-transceiver node that
can be reused by other CAN peripheral drivers to determine for both
CAN and CAN-FD what the max bitrate that can be used. If the user
tries to configure CAN to pass these maximum bitrates it will throw
an error.
Also add support for reading bitrate_max via the netlink interface.
Reviewed-by: Suman Anna <s-anna@ti.com>
Signed-off-by: Franklin S Cooper Jr <fcooper@ti.com>
[nsekhar@ti.com: fix build error with !CONFIG_OF]
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This reverts commit ceaa001a17.
The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed
as a nested attribute to support all ERSPAN v1 and v2's fields.
The current attr is a be32 supporting only one field. Thus, this
patch reverts it and later patch will redo it using nested attr.
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Jiri Benc <jbenc@redhat.com>
Cc: Pravin Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF map offload follow similar path to program offload. At creation
time users may specify ifindex of the device on which they want to
create the map. Map will be validated by the kernel's
.map_alloc_check callback and device driver will be called for the
actual allocation. Map will have an empty set of operations
associated with it (save for alloc and free callbacks). The real
device callbacks are kept in map->offload->dev_ops because they
have slightly different signatures. Map operations are called in
process context so the driver may communicate with HW freely,
msleep(), wait() etc.
Map alloc and free callbacks are muxed via existing .ndo_bpf, and
are always called with rtnl lock held. Maps and programs are
guaranteed to be destroyed before .ndo_uninit (i.e. before
unregister_netdev() returns). Map callbacks are invoked with
bpf_devs_lock *read* locked, drivers must take care of exclusive
locking if necessary.
All offload-specific branches are marked with unlikely() (through
bpf_map_is_dev_bound()), given that branch penalty will be
negligible compared to IO anyway, and we don't want to penalize
SW path unnecessarily.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
As pointed out by Daniel Borkmann, using bpf_target_off() is not
necessary for xdp_rxq_info when extracting queue_index and
ifindex, as these members are u32 like BPF_W.
Also fix trivial spelling mistake introduced in same commit.
Fixes: 02dd3291b2 ("bpf: finally expose xdp_rxq_info to XDP bpf-programs")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
conntrack defrag is needed only if some module like CONNTRACK or NAT
explicitly requests it. For plain forwarding scenarios, defrag is
not needed and can be skipped if NOTRACK is set in a rule.
Since conntrack defrag is currently higher priority than raw table,
setting NOTRACK is not sufficient. We need to move raw to a higher
priority for iptables only.
This is achieved by introducing a module parameter "raw_before_defrag"
which allows to change the priority of raw table to place it before
defrag. By default, the parameter is disabled and the priority of raw
table is NF_IP_PRI_RAW to support legacy behavior. If the module
parameter is enabled, then the priority of the raw table is set to
NF_IP_PRI_RAW_BEFORE_DEFRAG.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Four patches from Or that add Hairpin support to mlx5:
===========================================================
From: Or Gerlitz <ogerlitz@mellanox.com>
We refer the ability of NIC HW to fwd packet received on one port to
the other port (also from a port to itself) as hairpin. The application API
is based
on ingress tc/flower rules set on the NIC with the mirred redirect
action. Other actions can apply to packets during the redirect.
Hairpin allows to offload the data-path of various SW DDoS gateways,
load-balancers, etc to HW. Packets go through all the required
processing in HW (header re-write, encap/decap, push/pop vlan) and
then forwarded, CPU stays at practically zero usage. HW Flow counters
are used by the control plane for monitoring and accounting.
Hairpin is implemented by pairing a receive queue (RQ) to send queue (SQ).
All the flows that share <recv NIC, mirred NIC> are redirected through
the same hairpin pair. Currently, only header-rewrite is supported as a
packet modification action.
I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing this
functionality
on HW simulator, before it was avail in the FW so the driver code could be
tested early.
===========================================================
From Feras three patches that provide very small changes that allow IPoIB
to support RX timestamping for child interfaces, simply by hooking the mlx5e
timestamping PTP ioctl to IPoIB child interface netdev profile.
One patch from Gal to fix a spilling mistake.
Two patches from Eugenia adds drop counters to VF statistics
to be reported as part of VF statistics in netlink (iproute2) and
implemented them in mlx5 eswitch.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJaVF5WAAoJEEg/ir3gV/o+fRkH/0PxjwJRA3REqhi/H8HOdH9f
cBLrOzFdqTCYQWQFCLFbMQ/Zgoel3KglpJ0iQMjuVFfjMbybVXOe8FAEVdbWHnfL
C+2HRMe8dplKrsq5UkxJhbyKhFKhl2XeMFYWonw9dSM7Nz5DyowQ1y1r5SgMlMAv
t3mYAIa4kZHK18BjDoIsCoAXXwsHiztR2irMp5+DwataTGP7vC7AsrucDxLA/qFf
I3E15DZk9s1f53PUuY7CYnUnJfMMP3VJdxpyx4k6xt9J2IMuilF4YyD6wpAKsVQU
/LzRkWI9x/6QindffqlrACeeidimOeY4pC4txIhS5uXgFXulugDHq1/Ih1sgZS8=
=g5vr
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2018-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
mlx5-updates-2018-01-08
Four patches from Or that add Hairpin support to mlx5:
===========================================================
From: Or Gerlitz <ogerlitz@mellanox.com>
We refer the ability of NIC HW to fwd packet received on one port to
the other port (also from a port to itself) as hairpin. The application API
is based
on ingress tc/flower rules set on the NIC with the mirred redirect
action. Other actions can apply to packets during the redirect.
Hairpin allows to offload the data-path of various SW DDoS gateways,
load-balancers, etc to HW. Packets go through all the required
processing in HW (header re-write, encap/decap, push/pop vlan) and
then forwarded, CPU stays at practically zero usage. HW Flow counters
are used by the control plane for monitoring and accounting.
Hairpin is implemented by pairing a receive queue (RQ) to send queue (SQ).
All the flows that share <recv NIC, mirred NIC> are redirected through
the same hairpin pair. Currently, only header-rewrite is supported as a
packet modification action.
I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing this
functionality
on HW simulator, before it was avail in the FW so the driver code could be
tested early.
===========================================================
From Feras three patches that provide very small changes that allow IPoIB
to support RX timestamping for child interfaces, simply by hooking the mlx5e
timestamping PTP ioctl to IPoIB child interface netdev profile.
One patch from Gal to fix a spilling mistake.
Two patches from Eugenia adds drop counters to VF statistics
to be reported as part of VF statistics in netlink (iproute2) and
implemented them in mlx5 eswitch.
Signed-off-by: David S. Miller <davem@davemloft.net>
It allows matching packets based on Segment Routing Header
(SRH) information.
The implementation considers revision 7 of the SRH draft.
https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-07
Currently supported match options include:
(1) Next Header
(2) Hdr Ext Len
(3) Segments Left
(4) Last Entry
(5) Tag value of SRH
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When a member joins a group, it also indicates a binding scope. This
makes it possible to create both node local groups, invisible to other
nodes, as well as cluster global groups, visible everywhere.
In order to avoid that different members end up having permanently
differing views of group size and memberhip, we must inhibit locally
and globally bound members from joining the same group.
We do this by using the binding scope as an additional separator between
groups. I.e., a member must ignore all membership events from sockets
using a different scope than itself, and all lookups for message
destinations must require an exact match between the message's lookup
scope and the potential target's binding scope.
Apart from making it possible to create local groups using the same
identity on different nodes, a side effect of this is that it now also
becomes possible to create a cluster global group with the same identity
across the same nodes, without interfering with the local groups.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ability to set speed and duplex for virtio_net is useful in various
scenarios as described here:
16032be virtio_net: add ethtool support for set and get of settings
However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.
Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.
Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
is that device feature bits are to grow down from bit 63, since the
transports are starting from bit 24 and growing up.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtio-dev@lists.oasis-open.org
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds support for the GCM-AES-256 cipher suite as specified in
IEEE 802.1AEbn-2011. The prepared cipher suite selection mechanism is used,
with GCM-AES-128 being the default cipher suite as defined in the standard.
Signed-off-by: Felix Walter <felix.walter@cloudandheat.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modern hardware can decide to drop packets going to/from a VF.
Add receive and transmit drop counters to be displayed at hypervisor
layer in iproute2 per VF statistics.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your
net-next tree:
1) Free hooks via call_rcu to speed up netns release path, from
Florian Westphal.
2) Reduce memory footprint of hook arrays, skip allocation if family is
not present - useful in case decnet support is not compiled built-in.
Patches from Florian Westphal.
3) Remove defensive check for malformed IPv4 - including ihl field - and
IPv6 headers in x_tables and nf_tables.
4) Add generic flow table offload infrastructure for nf_tables, this
includes the netlink control plane and support for IPv4, IPv6 and
mixed IPv4/IPv6 dataplanes. This comes with NAT support too. This
patchset adds the IPS_OFFLOAD conntrack status bit to indicate that
this flow has been offloaded.
5) Add secpath matching support for nf_tables, from Florian.
6) Save some code bytes in the fast path for the nf_tables netdev,
bridge and inet families.
7) Allow one single NAT hook per point and do not allow to register NAT
hooks in nf_tables before the conntrack hook, patches from Florian.
8) Seven patches to remove the struct nf_af_info abstraction, instead
we perform direct calls for IPv4 which is faster. IPv6 indirections
are still needed to avoid dependencies with the 'ipv6' module, but
these now reside in struct nf_ipv6_ops.
9) Seven patches to handle NFPROTO_INET from the Netfilter core,
hence we can remove specific code in nf_tables to handle this
pseudofamily.
10) No need for synchronize_net() call for nf_queue after conversion
to hook arrays. Also from Florian.
11) Call cond_resched_rcu() when dumping large sets in ipset to avoid
softlockup. Again from Florian.
12) Pass lockdep_nfnl_is_held() to rcu_dereference_protected(), patch
from Florian Westphal.
13) Fix matching of counters in ipset, from Jozsef Kadlecsik.
14) Missing nfnl lock protection in the ip_set_net_exit path, also
from Jozsef.
15) Move connlimit code that we can reuse from nf_tables into
nf_conncount, from Florian Westhal.
And asorted cleanups:
16) Get rid of nft_dereference(), it only has one single caller.
17) Add nft_set_is_anonymous() helper function.
18) Remove NF_ARP_FORWARD leftover chain definition in nf_tables_arp.
19) Remove unnecessary comments in nf_conntrack_h323_asn1.c
From Varsha Rao.
20) Remove useless parameters in frag_safe_skb_hp(), from Gao Feng.
21) Constify layer 4 conntrack protocol definitions, function
parameters to register/unregister these protocol trackers, and
timeouts. Patches from Florian Westphal.
22) Remove nlattr_size indirection, from Florian Westphal.
23) Add fall-through comments as -Wimplicit-fallthrough needs this,
from Gustavo A. R. Silva.
24) Use swap() macro to exchange values in ipset, patch from
Gustavo A. R. Silva.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The "offset" option has been removed by
commit 900631ee6a ("l2tp: remove configurable payload offset").
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new instruction for the nf_tables VM that allows us to specify what
flows are offloaded into a given flow table via name. This new
instruction creates the flow entry and adds it to the flow table.
Only established flows, ie. we have seen traffic in both directions, are
added to the flow table. You can still decide to offload entries at a
later stage via packet counting or checking the ct status in case you
want to offload assured conntracks.
This new extension depends on the conntrack subsystem.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces a netlink control plane to create, delete and dump
flow tables. Flow tables are identified by name, this name is used from
rules to refer to an specific flow table. Flow tables use the rhashtable
class and a generic garbage collector to remove expired entries.
This also adds the infrastructure to add different flow table types, so
we can add one for each layer 3 protocol family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This new bit tells us that the conntrack entry is owned by the flow
table offload infrastructure.
# cat /proc/net/nf_conntrack
ipv4 2 tcp 6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] mark=0 zone=0 use=2
Note the [OFFLOAD] tag in the listing.
The timer of such conntrack entries look like stopped from userspace.
In practise, to make sure the conntrack entry does not go away, the
conntrack timer is periodically set to an arbitrary large value that
gets refreshed on every iteration from the garbage collector, so it
never expires- and they display no internal state in the case of TCP
flows. This allows us to save a bitcheck from the packet path via
nf_ct_is_expired().
Conntrack entries that have been offloaded to the flow table
infrastructure cannot be deleted/flushed via ctnetlink. The flow table
infrastructure is also responsible for releasing this conntrack entry.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows to reuse xt_connlimit infrastructure from nf_tables.
The upcoming nf_tables frontend can just pass in an nftables register
as input key, this allows limiting by any nft-supported key, including
concatenations.
For xt_connlimit, pass in the zone and the ip/ipv6 address.
With help from Yi-Hung Wei.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The kernel already has defines for this, but they are in uapi exposed
headers.
Including these from netns.h causes build errors and also adds unneeded
dependencies on heads that we don't need.
So move these defines to netfilter_defs.h and place the uapi ones
in ifndef __KERNEL__ to keep them for userspace.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>