Saeed Mahameed says:
====================
net/mlx5e: Kconfig fixes for VxLAN
Reposting to net the build errors fixes posted by Arnd last week.
Originally Arnd posted those fixes to net-next, while the issue
is also seen in net. For net-next a different approach is required
for fixing the issue as VXLAN and Device Drivers are no longer
dependent, but there is no harm for those fixes to get into net-next.
Optionally, once net is merged into net-next we can
Revert "net/mlx5e: make VXLAN support conditional" as the
CONFIG_MLX5_CORE_EN_VXLAN will no longer be required.
Applied on top: 2889286585 ('mlxsw: spectrum: Add missing rollback in flood configuration')
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
VXLAN can be disabled at compile-time or it can be a loadable
module while mlx5 is built-in, which leads to a link error:
drivers/net/built-in.o: In function `mlx5e_create_netdev':
ntb_netdev.c:(.text+0x106de4): undefined reference to `vxlan_get_rx_port'
This avoids the link error and makes the vxlan code optional,
like the other ethernet drivers do as well.
Link: https://patchwork.ozlabs.org/patch/589296/
Fixes: b3f63c3d5e ("net/mlx5e: Add netdev support for VXLAN tunneling")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 69976fb104.
We cannot select VXLAN when IPv4 support is disabled, that just gives
us additional build errors, including:
warning: (MLX5_CORE_EN) selects VXLAN which has unmet direct dependencies (NETDEVICES && NET_CORE && INET)
In file included from ../drivers/net/vxlan.c:36:0:
include/net/udp_tunnel.h: In function 'udp_tunnel_handle_offloads':
include/net/udp_tunnel.h:112:9: error: implicit declaration of function 'iptunnel_handle_offloads' [-Werror=implicit-function-declaration]
return iptunnel_handle_offloads(skb, type);
^~~~~~~~~~~~~~~~~~~~~~~~
I'm sending a proper fix for the original bug in a separate patch.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov says:
====================
sh_eth: couple of software reset bit cleanups
Here's a set of 2 patches against DaveM's 'net-next.git' repo. We can save
on the repetitive chip reset code...
[1/2] sh_eth: call sh_eth_tsu_write() from sh_eth_chip_reset_giga()
[2/2] sh_eth: reuse sh_eth_chip_reset()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
All the chip_reset() methods repeat the code writing to the ARSTR register
and delaying for 1 ms, so that we can reuse sh_eth_chip_reset() twice.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sh_eth_chip_reset_giga() doesn't really need to use direct iowrite32() when
writing to the ARSTR register, it can use sh_eth_tsu_write() as all other
chip_reset() methods.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that mdiobus_scan() doesn't return NULL on failure anymore, this driver
no longer needs to check for it...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-05-07
Here are a few more Bluetooth patches for the 4.7 kernel:
- NULL pointer fix in hci_intel driver
- New Intel Bluetooth controller id in btusb driver
- Added device tree binding documentation for Marvel's bt-sd8xxx
- Platform specific wakeup interrupt support for btmrvl driver
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The MACsec standard mentions a key identifier for each key, but
doesn't specify anything about it, so I arbitrarily chose 64 bits.
IEEE 802.1X-2010 specifies MKA (MACsec Key Agreement), and defines the
key identifier to be 128 bits (96 bits "member identifier" + 32 bits
"key number").
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor tcp_skb_cb to create two overlaping areas to store
state for incoming or outgoing skbs based on comments by
Neal Cardwell to tcp_nv patch:
AFAICT this patch would not require an increase in the size of
sk_buff cb[] if it were to take advantage of the fact that the
tcp_skb_cb header.h4 and header.h6 fields are only used in the packet
reception code path, and this in_flight field is only used on the
transmit side.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using ifb+netem on ingress on SIT/IPIP/GRE traffic,
GRO packets are not properly processed.
Segmentation should not be forced, since ifb is already adding
quite a performance hit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TC_ACT_STOLEN is used when ingress traffic is mirred/redirected
to say ifb.
Packet is not dropped, but consumed.
Only TC_ACT_SHOT is a clear indication something went wrong.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of custom approach re-use generic helpers to convert byte to hex
format.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In few places the term "ones-complement sum" was used but the actual
meaning is "the complement of the ones-complement sum".
Also, avoid enclosing long statements with underscore, to ease
readability.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On small embedded routers, one wants to control maximal amount of
memory used by fq_codel, instead of controlling number of packets or
bytes, since GRO/TSO make these not practical.
Assuming skb->truesize is accurate, we have to keep track of
skb->truesize sum for skbs in queue.
This patch adds a new TCA_FQ_CODEL_MEMORY_LIMIT attribute.
I chose a default value of 32 MBytes, which looks reasonable even
for heavy duty usages. (Prior fq_codel users should not be hurt
when they upgrade their kernels)
Two fields are added to tc_fq_codel_qd_stats to report :
- Current memory usage
- Number of drops caused by memory limits
# tc qd replace dev eth1 root est 1sec 4sec fq_codel memory_limit 4M
..
# tc -s -d qd sh dev eth1
qdisc fq_codel 8008: root refcnt 257 limit 10240p flows 1024
quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn
Sent 2083566791363 bytes 1376214889 pkt (dropped 4994406, overlimits 0
requeues 21705223)
rate 9841Mbit 812549pps backlog 3906120b 376p requeues 21705223
maxpacket 68130 drop_overlimit 4994406 new_flow_count 28855414
ecn_mark 0 memory_used 4190048 drop_overmemory 4994406
new_flows_len 1 old_flows_len 177
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Dave Täht <dave.taht@gmail.com>
Cc: Sebastian Möller <moeller0@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an implementation of Qualcomm's IPC router protocol, used to
communicate with service providing remote processors.
Signed-off-by: Courtney Cavin <courtney.cavin@sonymobile.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com>
[bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR]
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce compile stubs for the SMD API, allowing consumers to be
compile tested.
Acked-by: Andy Gross <andy.gross@linaro.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If GSO packet is segmented and its segments are properly queued,
we call consume_skb() instead of kfree_skb() to be drop monitor
friendly.
Fixes: 3e4f8b7873 ("macvtap: Perform GSO on forwarding path.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
klogctl can fail and return -ve len, so check for this and
return NULL to avoid passing a (size_t)-1 to malloc.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
"data_split" was never set to false. It's just uninitialized.
Fixes: 2950219d87 ('qede: Add basic network device support')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The error handling is broken here. netxen_rom_fast_read() returns zero
on success and -EIO on error. It never returns -1.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My static checker complains that we are using "autoneg" without
initializing it. The problem is the ->phy_read() condition is reversed
so we only set this on error instead of success.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My static checker complained that "v" can be used unintialized if
netxen_rom_fast_read() returns -EIO. That function never actually
returns -1.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When cxgb4 is enabled with CONFIG_CHELSIO_T4_DCB set, VI enable command
gets called with DCB enabled. But when we have a back to back setup with
DCB enabled on one side and non-DCB on the Peer side. Firmware doesn't
send any DCB_L2_CFG, and DCB priority is never set for Tx queue.
But driver resets the queue priority and state machine whenever there
is a link down, this patch fixes it by adding a check to reset only if
cxgb4_dcb_enabled() returns true.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here are 3 small fixes for some driver problems that were reported.
Full details in the shortlog below.
All of these have been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlcuKUQACgkQMUfUDdst+yk8qgCguPDODcYzOWiH1+RtIXTH5kXG
/1EAoIx7+uhzX9pt9E635NsrcNJqefWx
=9Uhc
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull misc driver fixes from Gfreg KH:
"Here are three small fixes for some driver problems that were
reported. Full details in the shortlog below.
All of these have been in linux-next with no reported issues"
* tag 'char-misc-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
nvmem: mxs-ocotp: fix buffer overflow in read
Drivers: hv: vmbus: Fix signaling logic in hv_need_to_signal_on_read()
misc: mic: Fix for double fetch security bug in VOP driver
Well, it's really just IIO drivers here, some small fixes that resolve
some "crash on boot" errors that have shown up in the -rc series, and
other bugfixes that are required.
All have been in linux-next with no reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlcuKBgACgkQMUfUDdst+ynALgCgh4QWOZ/vnLQvx/r1ZOIW1xqm
QdAAn1eJ1/KzwbM+WmmfXu1dKykfs7YS
=tGM3
-----END PGP SIGNATURE-----
Merge tag 'staging-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull IIO driver fixes from Grek KH:
"It's really just IIO drivers here, some small fixes that resolve some
'crash on boot' errors that have shown up in the -rc series, and other
bugfixes that are required.
All have been in linux-next with no reported problems"
* tag 'staging-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio: imu: mpu6050: Fix name/chip_id when using ACPI
iio: imu: mpu6050: fix possible NULL dereferences
iio:adc:at91-sama5d2: Repair crash on module removal
iio: ak8975: fix maybe-uninitialized warning
iio: ak8975: Fix NULL pointer exception on early interrupt
Here are some last-remaining fixes for USB drivers to resolve issues
that have shown up in testing. And 2 new device ids as well.
All of these have been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlcuJqEACgkQMUfUDdst+yl07wCeMMXyn3ZgOgpxiAAFUjBiHN5P
86gAn2Kh/eihIFYwPRrHypbE67RO+yTx
=QYaR
-----END PGP SIGNATURE-----
Merge tag 'usb-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some last-remaining fixes for USB drivers to resolve issues
that have shown up in testing. And two new device ids as well.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
Revert "USB / PM: Allow USB devices to remain runtime-suspended when sleeping"
usb: musb: jz4740: fix error check of usb_get_phy()
Revert "usb: musb: musb_host: Enable HCD_BH flag to handle urb return in bottom half"
usb: musb: gadget: nuke endpoint before setting its descriptor to NULL
USB: serial: cp210x: add Straizona Focusers device ids
USB: serial: cp210x: add ID for Link ECU
Pull ARM fixes from Russell King:
"These are a number of updates to fix a few problems found in the ARM
nommu code over the last couple of years, caused mostly by changes on
the mmu side"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 8573/1: domain: move {set,get}_domain under config guard
ARM: 8572/1: nommu: change memory reserve for the vectors
ARM: 8571/1: nommu: fix PMSAv7 setup
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXLdt0AAoJEAhfPr2O5OEVMzYQAKGk2BqfPQO5+vV0S4FHG/5I
kBe2y5MhedikMTm3yp77jLqkUPO3zbuZ3BZW4IJbOhl0STYzT51C573Z+6utfSvr
b6zMvTfXPIe+iFoqO7fsO/zJuJtlFjm4wzO/S5v5mYrks5z3oLwN7v34QtkKfVM1
3pfwlXyIthE4Cff0j18SsyUpkjN3sF8lDJOFSjH2C1a9yfUpOAX9iPXax88S9fuc
cNyp6msrsa8qoYW4Z3IcMKzw7gCWXberaLYPv0Jy9zmk5OKLgJoUS3oKo0B29HSu
xxIhW60uZDv9+nQF623EUdIF7hRzSjTnnKK+qQFD+JX8c5qmIoPrQQW90WefFO2e
suBRvmg0xC6j0rQRjGbSdTm6mFjPeO6YHWcO3JpoQ7FvdYlRIq29H0PRYzr4bADD
IiJlxqMrBRX2l5T1fKw/Fx9fag4asl/svdYjw7fVYdrn9Q6dRDR/lj4dUsC7n0+J
lgKs9Lvs4ZE0n0TlAQEkgeL3Xe2RNuiO/90m3WEMFvT1YFX0BxW/w2RcCuAilt78
z24ljeeBrQWaR8wnsENoE+emwr1aa5hkTl5qgudiXBWO7obUiCbYkaFpFGU1DfEe
VE1JHBG82vbqTyYbLjLVpuLpp61qpXY5WngYjzG9gMetBAOaWkiPSIfq69h5c2JK
K+JEoMfJdrIpsqJ7k4E+
=0D5s
-----END PGP SIGNATURE-----
Merge tag 'media/v4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- deadlock fixes on driver probe at exynos4-is and s43-camif drivers
- a build breakage if media controller is enabled and USB or PCI is
built as module.
* tag 'media/v4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] media-device: fix builds when USB or PCI is compiled as module
[media] media: s3c-camif: fix deadlock on driver probe()
[media] media: exynos4-is: fix deadlock on driver probe
Pull libata fixes from Tejun Heo:
"An ahci driver addition and updates to ahci port enable handling for
some platform devices"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
ata: add AMD Seattle platform driver
ARM: dts: apq8064: add ahci ports-implemented mask
ata: ahci-platform: Add ports-implemented DT bindings.
libahci: save port map for forced port map
When we fail to set the flooding configuration for the broadcast and
unregistered multicast traffic, we should revert the flooding
configuration of the unknown unicast traffic.
Fixes: 0293038e0c ("mlxsw: spectrum: Add support for flood control")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the leave procedure in the error path symmetric to the join
procedure and first remove the port from the collector before
potentially destroying the LAG.
Fixes: 0d65fc1304 ("mlxsw: spectrum: Implement LAG port join/leave")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP tunnel segmentation code relies on the inner offsets being set for
an UDP tunnel GSO packet, but the inner *_complete() functions will
set the inner offsets only if 'encapsulation' is set before calling
them. Currently, udp_gro_complete() sets 'encapsulation' only after
the inner *_complete() functions are done. This causes the inner
offsets having invalid values after udp_gro_complete() returns, which
in turn will make it impossible to properly segment the packet in case
it needs to be forwarded, which would be visible to the user either as
invalid packets being sent or as packet loss.
This patch fixes this by setting skb's 'encapsulation' in
udp_gro_complete() before calling into the inner complete functions,
and by making each possible UDP tunnel gro_complete() callback set the
inner_mac_header to the beginning of the tunnel payload.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Reviewed-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The setting of the UDP tunnel GSO type is already performed by
udp[46]_gro_complete().
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating macvtaps that are expected to have the same ifindex
in different network namespaces, only the first one will succeed.
The others will fail with a sysfs_warn_dup warning due to them trying
to create the following sysfs link (with 'NN' the ifindex of macvtapX):
/sys/class/macvtap/tapNN -> /sys/devices/virtual/net/macvtapX/tapNN
This is reproducible by running the following commands:
ip netns add ns1
ip netns add ns2
ip link add veth0 type veth peer name veth1
ip link set veth0 netns ns1
ip link set veth1 netns ns2
ip netns exec ns1 ip l add link veth0 macvtap0 type macvtap
ip netns exec ns2 ip l add link veth1 macvtap1 type macvtap
The last command will fail with "RTNETLINK answers: File exists" (along
with the kernel warning) but retrying it will work because the ifindex
was incremented.
The 'net' device class is isolated between network namespaces so each
one has its own hierarchy of net devices.
This isn't the case for the 'macvtap' device class.
The problem occurs half-way through the netdev registration, when
`macvtap_device_event` is called-back to create the 'tapNN' macvtap
class device under the 'macvtapX' net class device.
This patch adds namespace support to the 'macvtap' device class so
that /sys/class/macvtap is no longer shared between net namespaces.
However, making the macvtap sysfs class namespace-aware has the side
effect of changing /sys/devices/virtual/net/macvtapX/tapNN into
/sys/devices/virtual/net/macvtapX/macvtap/tapNN.
This is due to Commit 24b1442 ("Driver-core: Always create class
directories for classses that support namespaces") and the fact that
class devices supporting namespaces are really not supposed to be placed
directly under other class devices.
To avoid breaking userland, a tapNN symlink pointing to macvtap/tapNN is
created inside the macvtapX directory.
Signed-off-by: Marc Angel <marc@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull writeback fix from Jens Axboe:
"Just a single fix for domain aware writeback, fixing a regression that
can cause balance_dirty_pages() to keep looping while not getting any
work done"
* 'for-linus' of git://git.kernel.dk/linux-block:
writeback: Fix performance regression in wb_over_bg_thresh()
I forgot that ip_send_unicast_reply() is not BH safe (yet).
Disabling preemption before calling it was not a good move.
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
bpf: introduce direct packet access
This set of patches introduce 'direct packet access' from
cls_bpf and act_bpf programs (which are root only).
Current bpf programs use LD_ABS, LD_INS instructions which have
to do 'if (off < skb_headlen)' for every packet access.
It's ok for socket filters, but too slow for XDP, since single
LD_ABS insn consumes 3% of cpu. Therefore we have to amortize the cost
of length check over multiple packet accesses via direct access
to skb->data, data_end pointers.
The existing packet parser typically look like:
if (load_half(skb, offsetof(struct ethhdr, h_proto)) != ETH_P_IP)
return 0;
if (load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)) != IPPROTO_UDP ||
load_byte(skb, ETH_HLEN) != 0x45)
return 0;
...
with 'direct packet access' the bpf program becomes:
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct eth_hdr *eth = data;
struct iphdr *iph = data + sizeof(*eth);
if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
return 0;
if (eth->h_proto != htons(ETH_P_IP))
return 0;
if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
return 0;
...
which is more natural to write and significantly faster.
See patch 6 for performance tests:
21Mpps(old) vs 24Mpps(new) with just 5 loads.
For more complex parsers the performance gain is higher.
The other approach implemented in [1] was adding two new instructions
to interpreter and JITs and was too hard to use from llvm side.
The approach presented here doesn't need any instruction changes,
but the verifier has to work harder to check safety of the packet access.
Patch 1 prepares the code and Patch 2 adds new checks for direct
packet access and all of them are gated with 'env->allow_ptr_leaks'
which is true for root only.
Patch 3 improves search pruning for large programs.
Patch 4 wires in verifier's changes with net/core/filter side.
Patch 5 updates docs
Patches 6 and 7 add tests.
[1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
add few tests for "pointer to packet" logic of the verifier
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
parse_simple.c - packet parser exapmle with single length check that
filters out udp packets for port 9
parse_varlen.c - variable length parser that understand multiple vlan headers,
ipip, ipip6 and ip options to filter out udp or tcp packets on port 9.
The packet is parsed layer by layer with multitple length checks.
parse_ldabs.c - classic style of packet parsing using LD_ABS instruction.
Same functionality as parse_simple.
simple = 24.1Mpps per core
varlen = 22.7Mpps
ldabs = 21.4Mpps
Parser with LD_ABS instructions is slower than full direct access parser
which does more packet accesses and checks.
These examples demonstrate the choice bpf program authors can make between
flexibility of the parser vs speed.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
explain how verifier checks safety of packet access
and update email addresses.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
The bpf helpers that change skb->data need to update data_end pointer as well.
The verifier checks that programs always reload data, data_end pointers
after calls to such bpf helpers.
We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
since UNKNOWN_VALUE type is weaker than CONST_IMM we can un-teach
verifier its recognition of constants in conditional branches
without affecting safety.
Ex:
if (reg == 123) {
.. here verifier was marking reg->type as CONST_IMM
instead keep reg as UNKNOWN_VALUE
}
Two verifier states with UNKNOWN_VALUE are equivalent, whereas
CONST_IMM_X != CONST_IMM_Y, since CONST_IMM is used for stack range
verification and other cases.
So help search pruning by marking registers as UNKNOWN_VALUE
where possible instead of CONST_IMM.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extended BPF carried over two instructions from classic to access
packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
but due to their design they have to do length check for every access.
When BPF is processing 20M packets per second single LD_ABS after JIT
is consuming 3% cpu. Hence the need to optimize it further by amortizing
the cost of 'off < skb_headlen' over multiple packet accesses.
One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
with similar usage as skb_header_pointer().
The kernel part for interpreter and x64 JIT was implemented in [1], but such
new insns behave like old ld_abs and abort the program with 'return 0' if
access is beyond linear data. Such hidden control flow is hard to workaround
plus changing JITs and rolling out new llvm is incovenient.
Therefore allow cls_bpf/act_bpf program access skb->data directly:
int bpf_prog(struct __sk_buff *skb)
{
struct iphdr *ip;
if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
/* packet too small */
return 0;
ip = skb->data + ETH_HLEN;
/* access IP header fields with direct loads */
if (ip->version != 4 || ip->saddr == 0x7f000001)
return 1;
[...]
}
This solution avoids introduction of new instructions. llvm stays
the same and all JITs stay the same, but verifier has to work extra hard
to prove safety of the above program.
For XDP the direct store instructions can be allowed as well.
The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
the alignment. The complex packet parsers where packet pointer is adjusted
incrementally cannot be tracked for alignment, so allow byte access in such cases
and misaligned access on architectures that define efficient_unaligned_access
[1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
cleanup verifier code and prepare it for addition of "pointer to packet" logic
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 fixes from Ingo Molnar:
"This contains two fixes: a boot fix for older SGI/UV systems, and an
APIC calibration fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO
x86/platform/UV: Bring back the call to map_low_mmrs in uv_system_init
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2016-05-05
This series contains updates to i40e and i40evf.
The theme behind this series is code reduction, yeah! Jesse provides
most of the changes starting with a refactor of the interpretation of
a tunnel which lets us start using the hardware's parsing. Removed
the packet split receive routine and ancillary code in preparation
for the Rx-refactor. The refactor of the receive routine,
aligns the receive routine with the one in ixgbe which was highly
optimized. The hardware supports a 16 byte descriptor for receive,
but the driver was never using it in production. There was no performance
benefit to the real driver of 16 byte descriptors, so drop a whole lot
of complexity while getting rid of the code. Fixed a bug where while
changing the number of descriptors using ethtool, the driver did not
test the limits of the system memory before permanently assuming it
would be able to get receive buffer memory.
Mitch fixes a memory leak of one page each time the driver is opened by
allocating the correct number of receive buffers and do not fiddle with
next_to_use in the VF driver.
Arnd Bergmann fixed a indentation issue by adding the appropriate
curly braces in i40e_vc_config_promiscuous_mode_msg().
Julia Lawall fixed an issue found by Coccinelle, where i40e_client_ops
structure can be const since it is never modified.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>