The ring_cookie is 64 bits wide which is much larger than can be used
for actual queue index values. So provide some helper routines to
pack a VF index into the cookie. This is useful to steer packets to
a VF ring without having to know the queue layout of the device.
CC: Alex Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
After commit 07f4c90062 ("tcp/dccp: try to not exhaust
ip_local_port_range in connect()") it is advised to have an even number
of ports described in /proc/sys/net/ipv4/ip_local_port_range
This means start/end values should have a different parity.
Let's warn sysadmins of this, so that they can update their settings
if they want to.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This vendor id will be used by network (vNIC), USB (xHCI),
SATA (AHCI), GPIO, I2C, MMC and maybe other drivers
for ThunderX SoC.
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently always send fragments without DF bit set.
Thus, given following setup:
mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
A R1 R2 B
Where R1 and R2 run linux with netfilter defragmentation/conntrack
enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
will respond with icmp too big error if one of these fragments exceeded
1400 bytes.
However, if R1 receives fragment sizes 1200 and 100, it would
forward the reassembled packet without refragmenting, i.e.
R2 will send an icmp error in response to a packet that was never sent,
citing mtu that the original sender never exceeded.
The other minor issue is that a refragmentation on R1 will conceal the
MTU of R2-B since refragmentation does not set DF bit on the fragments.
This modifies ip_fragment so that we track largest fragment size seen
both for DF and non-DF packets, and set frag_max_size to the largest
value.
If the DF fragment size is larger or equal to the non-df one, we will
consider the packet a path mtu probe:
We set DF bit on the reassembled skb and also tag it with a new IPCB flag
to force refragmentation even if skb fits outdev mtu.
We will also set DF bit on each fragment in this case.
Joint work with Hannes Frederic Sowa.
Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
RGMII interfaces come in 4 different flavors that the PHY library needs
to care about: regular RGMII (no delays), RGMII with either RX or TX
delay, and both. In order to avoid errors of checking only for one type
of RGMII interface and miss the 3 others, introduce a convenience
function which tests for all values.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If tcp ehash table is constrained to a very small number of buckets
(eg boot parameter thash_entries=128), then we can crash if spinlock
array has more entries.
While we are at it, un-inline inet_ehash_locks_alloc() and make
following changes :
- Budget 2 cache lines per cpu worth of 'spinlocks'
- Try to kmalloc() the array to avoid extra TLB pressure.
(Most servers at Google allocate 8192 bytes for this hash table)
- Get rid of various #ifdef
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_select_ident() returns a 32bit value in network order.
Fixes: 286c2349f6 ("ipv6: Clean up ipv6_select_ident() and ip6_fragment()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the patch
'ipv6: Only create RTF_CACHE routes after encountering pmtu exception',
we need to compensate the performance hit (bouncing dst->__refcnt).
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch keeps track of the DST_NOCACHE routes in a list and replaces its
dev with loopback during the iface down/unregister event.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch always creates RTF_CACHE clone with DST_NOCACHE
when FLOWI_FLAG_KNOWN_NH is set so that the rt6i_dst is set to
the fl6->daddr.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Julian Anastasov <ja@ssi.bg>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of doing the rt6->rt6i_node check whenever we need
to get the route's cookie. Refactor it into rt6_get_cookie().
It is a prep work to handle FLOWI_FLAG_KNOWN_NH and also
percpu rt6_info later.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates a RTF_CACHE routes only after encountering a pmtu
exception.
After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6
tree, the rt->rt6i_node->fn_sernum is bumped which will fail the
ip6_dst_check() and trigger a relookup.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst.
Also, rt6i_gateway is always set to the nexthop while the nexthop
could be a gateway or the rt6i_dst.addr.
After removing the rt6i_dst and rt6i_src dependency in the last patch,
we also need to stop the caller from depending on rt6i_gateway and
RTF_ANYCAST.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the assumptions that the returned rt is always
a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the
destination and source address. The dst and src can be recovered from
the calling site.
We may consider to rename (rt6i_dst, rt6i_src) to
(rt6i_key_dst, rt6i_key_src) later.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the ipv6_select_ident() signature to return a
fragment id instead of taking a whole frag_hdr as a param to
only set the frag_hdr->identification.
It also cleans up ip6_fragment() to obtain the fragment id at the
beginning instead of using multiple "if" later to check fragment id
has been generated or not.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.
AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
use a callback to make the locking and unlocking configureable.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Don't leak ipvs->sysctl_tbl, from Tommi Rentala.
2) Fix neighbour table entry leak in rocker driver, from Ying Xue.
3) Do not emit bonding notifications for unregistered interfaces, from
Nicolas Dichtel.
4) Set ipv6 flow label properly when in TIME_WAIT state, from Florent
Fourcot.
5) Fix regression in ipv6 multicast filter test, from Henning Rogge.
6) do_replace() in various footables netfilter modules is missing a
check for 0 counters in the datastructure provided by the user. Fix
from Dave Jones, and found with trinity.
7) Fix RCU bug in packet scheduler classifier module unloads, from
Daniel Borkmann.
8) Avoid deadlock in tcp_get_info() by using u64_sync. From Eric
Dumzaet.
9) Input packet processing can race with inetdev_destroy() teardown,
fix potential OOPS in ip_error() by explicitly testing whether the
inetdev is still attached. From Eric W Biederman.
10) MLDv2 parser in bridge multicast code breaks too early while
parsing. Fix from Thadeu Lima de Souza Cascardo.
11) Asking for settings on non-zero PHYID doesn't work because we do not
import the command structure from the user and use the PHYID
provided there. Fix from Arun Parameswaran.
12) Fix UDP checksums with IPV6 RAW sockets, from Vlad Yasevich.
13) Missing NF_TABLES depends for TPROXY etc can cause build failures,
fix from Florian Westphal.
14) Fix netfilter conntrack to handle RFC5961 challenge ACKs properly,
from Jesper Dangaard Brouer.
15) If netlink autobind retry fails, we have to reset the sockets portid
back to zero. From Herbert Xu.
16) VXLAN netns exit code unregisters using wrong device, from John W
Linville.
17) Add some USB device IDs to ath3k and btusb bluetooth drivers, from
Dmitry Tunin and Wen-chien Jesse Sung.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
bridge: fix lockdep splat
net: core: 'ethtool' issue with querying phy settings
bridge: fix parsing of MLDv2 reports
ARM: zynq: DT: Use the zynq binding with macb
net: macb: Disable half duplex gigabit on Zynq
net: macb: Document zynq gem dt binding
ipv4: fill in table id when replacing a route
cdc_ncm: Fix tx_bytes statistics
ipv4: Avoid crashing in ip_error
tcp: fix a potential deadlock in tcp_get_info()
net: sched: fix call_rcu() race on classifier module unloads
net: phy: Make sure phy_start() always re-enables the phy interrupts
ipv6: fix ECMP route replacement
ipv6: do not delete previously existing ECMP routes if add fails
Revert "netfilter: bridge: query conntrack about skb dnat"
netfilter: ensure number of counters is >0 in do_replace()
netfilter: nfnetlink_{log,queue}: Register pernet in first place
tcp: don't over-send F-RTO probes
tcp: only undo on partial ACKs in CA_Loss
net/ipv6/udp: Fix ipv6 multicast socket filter regression
...
Pull block fixes from Jens Axboe:
"Three small fixes that have been picked up the last few weeks.
Specifically:
- Fix a memory corruption issue in NVMe with malignant user
constructed request. From Christoph.
- Kill (now) unused blk_queue_bio(), dm was changed to not need this
anymore. From Mike Snitzer.
- Always use blk_schedule_flush_plug() from the io_schedule() path
when flushing a plug, fixing a !TASK_RUNNING warning with md. From
Shaohua"
* 'for-linus' of git://git.kernel.dk/linux-block:
sched: always use blk_schedule_flush_plug in io_schedule_out
nvme: fix kernel memory corruption with short INQUIRY buffers
block: remove export for blk_queue_bio
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contain Netfilter fixes for your net tree, they are:
1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash.
This problem is due to wrong order in the per-net registration and netlink
socket events. Patch from Francesco Ruggeri.
2) Make sure that counters that userspace pass us are higher than 0 in all the
x_tables frontends. Discovered via Trinity, patch from Dave Jones.
3) Revert a patch for br_netfilter to rely on the conntrack status bits. This
breaks stateless IPv6 NAT transformations. Patch from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Taking socket spinlock in tcp_get_info() can deadlock, as
inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
while packet processing can use the reverse locking order.
We could avoid this locking for TCP_LISTEN states, but lockdep would
certainly get confused as all TCP sockets share same lockdep classes.
[ 523.722504] ======================================================
[ 523.728706] [ INFO: possible circular locking dependency detected ]
[ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted
[ 523.739202] -------------------------------------------------------
[ 523.745474] ss/18032 is trying to acquire lock:
[ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
[ 523.758129]
[ 523.758129] but task is already holding lock:
[ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
[ 523.774661]
[ 523.774661] which lock already depends on the new lock.
[ 523.774661]
[ 523.782850]
[ 523.782850] the existing dependency chain (in reverse order) is:
[ 523.790326]
-> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
[ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270
[ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
[ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
[ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
[ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
[ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
[ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
[ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
[ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
[ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
[ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420
Lets use u64_sync infrastructure instead. As a bonus, 64bit
arches get optimized, as these are nop for them.
Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jamal points out that this header also contains kernel internal magic that
cannot be used from userspace for anything meaningful.
Lets remove what the kernel doesn't use anymore and wrap remainder with
__KERNEL__.
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tracks the total number of inbound and outbound segments on a
TCP socket. One may use this number to have an idea on connection
quality when compared against the retransmissions.
RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
These are a 32bit field each and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)
Note that tp->segs_out was placed near tp->snd_nxt for good data
locality and minimal performance impact, while tp->segs_in was placed
near tp->bytes_received for the same reason.
Join work with Eric Dumazet.
Note that received SYN are accounted on the listener, but sent SYNACK
are not accounted.
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull HID fixes from Jiri Kosina:
"Bugfixes for HID subsystem that should go in 4.1. Important
highlights:
- the patch that extended support for HID++ protocol for TK820
touchpad turns out to be causing regressions due to firmware
issues; patch reverting back to basic support from Benjamin
Tissoires
- Wacom driver can oops for devices that report non-touch data on
touch interfaces. Fix from Ping Cheng
- gpiolib is not mandatory for i2c-hid, so the driver shouldn't fail
if gpiolib is not enabled. Fix from Mika Westerberg"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: wacom: fix an Oops caused by wacom_wac_finger_count_touches
HID: usbhid: Add HID_QUIRK_NOGET for Aten DVI KVM switch
HID: hid-sensor-hub: Fix debug lock warning
Revert "HID: logitech-hidpp: support combo keyboard touchpad TK820"
HID: i2c-hid: Do not fail probing if gpiolib is not enabled
exception of a single locking fix in the core code. All driver fixes are
for code that was merged recently. The Samsung stuff is mostly fixes
around suspend/resume, the Qualcomm fixes are for invalid hardware
configuration data and the Silicon Labs patches are fixes following
their move away from platform_data to Device Tree.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVW/1LAAoJEKI6nJvDJaTUuTEP/i0mykL1zAXOoUDAVOsULkLq
55MUnknjfoh4CHeLRru+XeBkGNDjLKPNg4y47NadrMvRsav3sqY9WR6HZasB3roa
uGanJZRr1nVqaD3l2ki8twc4FgNfeEaFfY018Z3dDLSYbnsq990hWQChgTGLxuDN
iz9fh8DdOALa9dp2qo604oUjVN9QU+rqRClA1d8JhtiEfeCkTjzUBO8pkJ5ZO8ve
sRgQ6TkY0u69rVTYspot1cwkPLL8gCiX+nTazakhKNxEg6W1q7PrrHtZsWms47P5
SpemNaSnwwRBy1wYnbZAxgsLOmg/g7seICN/uz/OyR9ZgzRJz66HeI9yPceaCbnz
9eNxGtsDWvE5uejoqIBqivFULTtuUdo5ZOe5Xwz/b35Hy0l79T5Ag7NmPNwkAqOd
WOSMuv581tnRpgHz3bG+PFknsjqzCKCoMfW3bjHIH56lZA9AMlJhKuV6g85XY3WA
WAF+QtBWqD76TMzcf3DMg9egGn2321jvs5/I8jRMa9xCgV0Ycam0DwzDZ6G8KAhK
nWWoUyQSJU0tElphfylqSFR+00anovbB/BORGwMhZ9gohDR4UCdtCK2FwOxRY7kg
SWU1ruDpFSfLMstpwU9Rb+9F0TtSkxWWPtk23bto83erYXzx0ezwhLIA/pEPi+mO
chOqdtgPvtfS3NeBwB/5
=yjsw
-----END PGP SIGNATURE-----
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Michael Turquette:
"The first set of clk fixes for 4.1 are all driver bugs, with the
exception of a single locking fix in the core code.
All driver fixes are for code that was merged recently. The Samsung
stuff is mostly fixes around suspend/resume, the Qualcomm fixes are
for invalid hardware configuration data and the Silicon Labs patches
are fixes following their move away from platform_data to Device Tree"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: si5351: Do not pass struct clk in platform_data
clk: si5351: Mention clock-names in the binding documentation
clk: add missing lock when call clk_core_enable in clk_set_parent
clk: exynos5420: Restore GATE_BUS_TOP on suspend
clk: qcom: Fix MSM8916 gfx3d_clk_src configuration
clk: qcom: Fix MSM8916 venus divider value
clk: exynos5433: Fix wrong PMS value of exynos5433_pll_rates
clk: exynos5433: Fix wrong parent clock of sclk_apollo clock
clk: exynos5433: Fix CLK_PCLK_MONOTONIC_CNT clk register assignment
clk: exynos5433: Fix wrong offset of PCLK_MSCL_SECURE_SMMU_JPEG
clk: Use CONFIG_ARCH_EXYNOS instead of CONFIG_ARCH_EXYNOS5433
We no longer need bsocket atomic counter, as inet_csk_get_port()
calls bind_conflict() regardless of its value, after commit
2b05ad33e1 ("tcp: bind() fix autoselection to share ports")
This patch removes overhead of maintaining this counter and
double inet_csk_get_port() calls under pressure.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Flavio Leitner <fbl@redhat.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 8e4d980ac2 ("tcp: fix behavior for epoll edge trigger")
we fixed a possible hang of TCP sockets under memory pressure,
by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
if no packet is in socket write queue.
It turns out there are other cases where we want to force memory
schedule :
tcp_fragment() & tso_fragment() need to split a big TSO packet into
two smaller ones. If we block here because of TCP memory pressure,
we can effectively block TCP socket from sending new data.
If no further ACK is coming, this hang would be definitive, and socket
has no chance to effectively reduce its memory usage.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit c055d5b03b.
There are two issues:
'dnat_took_place' made me think that this is related to
-j DNAT/MASQUERADE.
But thats only one part of the story. This is also relevant for SNAT
when we undo snat translation in reverse/reply direction.
Furthermore, I originally wanted to do this mainly to avoid
storing ipv6 addresses once we make DNAT/REDIRECT work
for ipv6 on bridges.
However, I forgot about SNPT/DNPT which is stateless.
So we can't escape storing address for ipv6 anyway. Might as
well do it for ipv4 too.
Reported-and-tested-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ip_do_nat() function was removed prior to kernel 3.4. Remove the
unnecessary function prototype as well.
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work as a follow-up of commit f7b3bec6f5 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A non-percpu VIRQ (e.g., VIRQ_CONSOLE) may be freed on a different
VCPU than it is bound to. This can result in a race between
handle_percpu_irq() and removing the action in __free_irq() because
handle_percpu_irq() does not take desc->lock. The interrupt handler
sees a NULL action and oopses.
Only use the percpu chip/handler for per-CPU VIRQs (like VIRQ_TIMER).
# cat /proc/interrupts | grep virq
40: 87246 0 xen-percpu-virq timer0
44: 0 0 xen-percpu-virq debug0
47: 0 20995 xen-percpu-virq timer1
51: 0 0 xen-percpu-virq debug1
69: 0 0 xen-dyn-virq xen-pcpu
74: 0 0 xen-dyn-virq mce
75: 29 0 xen-dyn-virq hvc_console
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: <stable@vger.kernel.org>
tcp_illinois and upcoming tcp_cdg require 64bit alignment of
icsk_ca_priv
x86 does not care, but other architectures might.
Fixes: 05cbc0db03 ("ipv4: Create probe timer for tcp PMTU as per RFC4821")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Fan Du <fan.du@intel.com>
Acked-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When bridge netfilter re-fragments an IP packet for output, all
packets that can not be re-fragmented to their original input size
should be silently discarded.
However, current bridge netfilter output path generates an ICMP packet
with 'size exceeded MTU' message for such packets, this is a bug.
This patch refactors the ip_fragment() API to allow two separate
use cases. The bridge netfilter user case will not
send ICMP, the routing output will, as before.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Improve readability of skip ICMP for de-fragmentation expiration logic.
This change will also make the logic easier to maintain when the
following patches in this series are applied.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next. Briefly
speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and
Serget Popovich, more incremental updates to make br_netfilter a better
place from Florian Westphal, ARP support to the x_tables mark match /
target from and context Zhang Chunyu and the addition of context to know
that the x_tables runs through nft_compat. More specifically, they are:
1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the
IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik.
2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef.
3) Use skb->network_header to calculate the transport offset in
ip_set_get_ip{4,6}_port(). From Alexander Drozdov.
4) Reduce memory consumption per element due to size miscalculation,
this patch and follow up patches from Sergey Popovich.
5) Expand nomatch field from 1 bit to 8 bits to allow to simplify
mtype_data_reset_flags(), also from Sergey.
6) Small clean for ipset macro trickery.
7) Fix error reporting when both ip_set_get_hostipaddr4() and
ip_set_get_extensions() from per-set uadt functions.
8) Simplify IPSET_ATTR_PORT netlink attribute validation.
9) Introduce HOST_MASK instead of hardcoded 32 in ipset.
10) Return true/false instead of 0/1 in functions that return boolean
in the ipset code.
11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute.
12) Allow to dereference from ext_*() ipset macros.
13) Get rid of incorrect definitions of HKEY_DATALEN.
14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match.
15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal.
16) Release nf_bridge_info after POSTROUTING since this is only needed
from the physdev match, also from Florian.
17) Reduce size of ipset code by deinlining ip_set_put_extensions(),
from Denys Vlasenko.
18) Oneliner to add ARP support to the x_tables mark match/target, from
Zhang Chunyu.
19) Add context to know if the x_tables extension runs from nft_compat,
to address minor problems with three existing extensions.
20) Correct return value in several seqfile *_show() functions in the
netfilter tree, from Joe Perches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The spinlock is used to protect netns_ids which is per net,
so there is no need to use a global spinlock.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump()
- use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops.
- add support for fdb obj in switchdev_port_obj_add/del/dump()
- switch rocker to implement fdb ops via switchdev_ops
v3: updated to sync with named union changes.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We plan to use sk_forced_wmem_schedule() in input path as well,
so make it non static and rename it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_mem_reclaim_partial() goal is to ensure each socket has
one SK_MEM_QUANTUM forward allocation. This is needed both for
performance and better handling of memory pressure situations in
follow up patches.
SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
as some arches have 64KB pages.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First one in __skb_checksum_validate_complete() fixes the following
(and other callers)
make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/tcp_ipv4.o
CHECK net/ipv4/tcp_ipv4.c
include/linux/skbuff.h:3052:24: warning: incorrect type in return expression (different base types)
include/linux/skbuff.h:3052:24: expected restricted __sum16
include/linux/skbuff.h:3052:24: got int
Second is fixing gso_make_checksum() :
CHECK net/ipv4/gre_offload.c
include/linux/skbuff.h:3360:14: warning: incorrect type in assignment (different base types)
include/linux/skbuff.h:3360:14: expected unsigned short [unsigned] [usertype] csum
include/linux/skbuff.h:3360:14: got restricted __sum16
include/linux/skbuff.h:3365:16: warning: incorrect type in return expression (different base types)
include/linux/skbuff.h:3365:16: expected restricted __sum16
include/linux/skbuff.h:3365:16: got unsigned short [unsigned] [usertype] csum
Fixes: 5a21232983 ("net: Support for csum_bad in skbuff")
Fixes: 7e2b10c1e5 ("net: Support for multiple checksums with gso")
Signed-off-by: Eric Dumazet <edumazet@google.com>
CC: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here's some TTY and serial driver fixes for reported issues. All of
these have been in linux-next successfully.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlVX4JMACgkQMUfUDdst+ymy/QCfSx/npI/WfRNlKBMHI20xwOaE
szQAoJKRxeF0d+2GCJ56OVbmqqjnG4IN
=FUtJ
-----END PGP SIGNATURE-----
Merge tag 'tty-4.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here's some TTY and serial driver fixes for reported issues.
All of these have been in linux-next successfully"
* tag 'tty-4.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
pty: Fix input race when closing
tty/n_gsm.c: fix a memory leak when gsmtty is removed
Revert "serial/amba-pl011: Leave the TX IRQ alone when the UART is not open"
serial: omap: Fix error handling in probe
earlycon: Revert log warnings
We currently have no limit on the number of elements in a hash table.
This is a problem because some users (tipc) set a ceiling on the
maximum table size and when that is reached the hash table may
degenerate. Others may encounter OOM when growing and if we allow
insertions when that happens the hash table perofrmance may also
suffer.
This patch adds a new paramater insecure_max_entries which becomes
the cap on the table. If unset it defaults to max_size * 2. If
it is also zero it means that there is no cap on the number of
elements in the table. However, the table will grow whenever the
utilisation hits 100% and if that growth fails, you will get ENOMEM
on insertion.
As allowing oversubscription is potentially dangerous, the name
contains the word insecure.
Note that the cap is not a hard limit. This is done for performance
reasons as enforcing a hard limit will result in use of atomic ops
that are heavier than the ones we currently use.
The reasoning is that we're only guarding against a gross over-
subscription of the table, rather than a small breach of the limit.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter fixes for your net tree, they are:
1) Fix a leak in IPVS, the sysctl table is not released accordingly when
destroying a netns, patch from Tommi Rantala.
2) Fix a build error when TPROXY and socket are built-in but IPv6 defrag is
compiled as module, from Florian Westphal.
3) Fix TCP tracket wrt. RFC5961 challenge ACK when in LAST_ACK state, patch
from Jesper Dangaard Brouer.
4) Fix a bogus WARN_ON() in nf_tables when deleting a set element that stores
a map, from Mirek Kratochvil.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull scheduler fixes from Ingo Molnar:
"Two fixes: a suspend/resume related regression fix, and an RT priority
boosting fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix regression in cpuset_cpu_inactive() for suspend
sched: Handle priority boosted tasks proper in setscheduler()