Relax the condition that overlayfs supports nfs export, to require
that i_ino is consistent with st_ino/d_ino.
It is enough to require that st_ino and d_ino are consistent.
This fixes the failure of xfstest generic/504, due to mismatch of
st_ino to inode number in the output of /proc/locks.
Fixes: 12574a9f4c ("ovl: consistent i_ino for non-samefs with xino")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Function mlx5_devlink_alloc is missing a void argument, add it
to clean up the non-ANSI function declaration.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier says:
====================
net: mvpp2: cls: Allow steering based on vlan tag
The PPv2 classifier can perform flow steering based on keys extracted
from the VLAN tag. This series adds support for using the vlan id and
the vlan prio as keys, using the ethtool interface.
Patch 1 is a preparatory patch that prevent false-positive matches,
using a dedicated lookup id for the RSS C2 lookup.
Patch 2 allows to separate the flows based on the header fields they
contain. The main goal is to be able to separate tagged traffic from
untagged traffic for flow steering, just as we already do for RSS.
Patch 3 solves an issue we have when extracting fields that aren't full
bytes, such as the vlan tag which is 12 bits wide, or the priority which
is 3 bits wide.
Finally, patch 4 adds support for steering based on both vlan id and
priority, extracted from the outermost tag.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit allows using the vlan Id and priority as parts of the key
for classification offload. These fields are extracted from the
outermost tag, if multiple tags are present.
Vlan Id and priority are considered as 2 different fields by the
classifier, however the fields are both appended in the Header Extracted
Key in the same layout as they are found in the tags. This means that
when steering only based on the prio, a 16-bit slot is still taken in
the HEK.
The classifier doesn't allow extracting the DEI bit from the tag, so we
explicitly prevent user from using this bit in the key.
This commit adds the vlan priotity as a compatible HEK field for
tagged traffic, meaning that we limit the possibility of extracting this
field only to the flows that contain tagged traffic.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The C2 TCAM used for classification uses a key (Header Extracted Key)
built by concatenating several fields extracted from the packet header.
After a lot of trial-and-error and some guess work, it seems the HEK is
right justified, with the first fields being stored in the MSB, then
concatenated up until the LSB.
Until now, this doesn't cause any issue since all HEK fields we use are
full bytes. However this is an issue for the upcoming VLAN id and pri
extraction, which aren't full bytes.
Rework the way we built that TCAM key, by changing the order in which we
append the fields.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The way we currently handle classification offload and RSS is by having
dedicated lookup sequences in the flow table, each being selected
depending on several fields being present in the packet header.
We need to make sure the classification operation we want to perform can
be done in each flow we want to insert it into. As an example,
classifying on VLAN tag can only be done on flows used for tagged
traffic.
This commit makes sure we don't insert rules in flows we aren't
compatible with.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When performing a TCAM lookup in the C2 engine, it's possible that
multiple entries match the packet. To make sure the correct entry match
when performing a lookup, the Flow Table can set a lookup type, which
will be used in the TCAM lookup, thus preventing such false-positives.
We need to make sure the RSS match doesn't interfere with other
classification lookups, hence we use a dedicated lookup_type for it.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yash Shah says:
====================
Add macb support for SiFive FU540-C000
On FU540, the management IP block is tightly coupled with the Cadence
MACB IP block. It manages many of the boundary signals from the MACB IP
This patchset controls the tx_clk input signal to the MACB IP. It
switches between the local TX clock (125MHz) and PHY TX clocks. This
is necessary to toggle between 1Gb and 100/10Mb speeds.
Future patches may add support for monitoring or controlling other IP
boundary signals.
This patchset is mostly based on work done by
Wesley Terpstra <wesley@sifive.com>
This patchset is based on Linux v5.2-rc1 and tested on HiFive Unleashed
board with additional board related patches needed for testing can be
found at dev/yashs/ethernet_v3 branch of:
https://github.com/yashshah7/riscv-linux.git
Change History:
V3:
- Revert "MACB_SIFIVE_FU540" config changes in Kconfig and driver code.
The driver does not depend on SiFive GPIO driver.
V2:
- Change compatible string from "cdns,fu540-macb" to "sifive,fu540-macb"
- Add "MACB_SIFIVE_FU540" in Kconfig to support SiFive FU540 in macb
driver. This is needed because on FU540, the macb driver depends on
SiFive GPIO driver.
- Avoid writing the result of a comparison to a register.
- Fix the issue of probe fail on reloading the module reported by:
Andreas Schwab <schwab@suse.de>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The management IP block is tightly coupled with the Cadence MACB IP
block on the FU540, and manages many of the boundary signals from the
MACB IP. This patch only controls the tx_clk input signal to the MACB
IP. Future patches may add support for monitoring or controlling other
IP boundary signals.
Signed-off-by: Yash Shah <yash.shah@sifive.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the compatibility string documentation for SiFive FU540-C0000
interface.
On the FU540, this driver also needs to read and write registers in a
management IP block that monitors or drives boundary signals for the
GEMGXL IP block that are not directly mapped to GEMGXL registers.
Therefore, add additional range to "reg" property for SiFive GEMGXL
management IP registers.
Signed-off-by: Yash Shah <yash.shah@sifive.com>
Reviewed-by: Paul Walmsley <paul.walmsley@sifive.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xue Chaojing says:
====================
hinic: add rss support and rss parameters configuration
This series add rss support for HINIC driver and implement the ethtool
interface related to rss parameter configuration. user can use ethtool
configure rss parameters or show rss parameters.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support rss parameters with ethtool,
user can change hash key, hash indirection table, hash
function by ethtool -X, and show rss parameters by ethtool -x.
Signed-off-by: Xue Chaojing <xuechaojing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves ethtool code from hinic_main.c to hinic_ethtool.c
Signed-off-by: Xue Chaojing <xuechaojing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds rss support for the HINIC driver.
Signed-off-by: Xue Chaojing <xuechaojing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Module autoload for masquerade and redirection does not work.
2) Leak in unqueued packets in nf_ct_frag6_queue(). Ignore duplicated
fragments, pretend they are placed into the queue. Patches from
Guillaume Nault.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, hvsock can enter into a state where epoll_wait on EPOLLOUT will
not return even when the hvsock socket is writable, under some race
condition. This can happen under the following sequence:
- fd = socket(hvsocket)
- fd_out = dup(fd)
- fd_in = dup(fd)
- start a writer thread that writes data to fd_out with a combination of
epoll_wait(fd_out, EPOLLOUT) and
- start a reader thread that reads data from fd_in with a combination of
epoll_wait(fd_in, EPOLLIN)
- On the host, there are two threads that are reading/writing data to the
hvsocket
stack:
hvs_stream_has_space
hvs_notify_poll_out
vsock_poll
sock_poll
ep_poll
Race condition:
check for epollout from ep_poll():
assume no writable space in the socket
hvs_stream_has_space() returns 0
check for epollin from ep_poll():
assume socket has some free space < HVS_PKT_LEN(HVS_SEND_BUF_SIZE)
hvs_stream_has_space() will clear the channel pending send size
host will not notify the guest because the pending send size has
been cleared and so the hvsocket will never mark the
socket writable
Now, the EPOLLOUT will never return even if the socket write buffer is
empty.
The fix is to set the pending size to the default size and never change it.
This way the host will always notify the guest whenever the writable space
is bigger than the pending size. The host is already optimized to *only*
notify the guest when the pending size threshold boundary is crossed and
not everytime.
This change also reduces the cpu usage somewhat since hv_stream_has_space()
is in the hotpath of send:
vsock_stream_sendmsg()->hv_stream_has_space()
Earlier hv_stream_has_space was setting/clearing the pending size on every
call.
Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes an issue where TX Timestamps are not arriving on the error queue
when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
This can be illustrated with an updated updgso_bench_tx program which
includes the '-T' option to test for this condition. It also introduces
the '-P' option which will call poll() before reading the error queue.
./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18
poll timeout
udp tx: 0 MB/s 1 calls/s 1 msg/s
The "poll timeout" message above indicates that TX timestamp never
arrived.
This patch preserves tx_flags for the first UDP GSO segment. Only the
first segment is timestamped, even though in some cases there may be
benefital in timestamping both the first and last segment.
Factors in deciding on first segment timestamp only:
- Timestamping both first and last segmented is not feasible. Hardware
can only have one outstanding TS request at a time.
- Timestamping last segment may under report network latency of the
previous segments. Even though the doorbell is suppressed, the ring
producer counter has been incremented.
- Timestamping the first segment has the upside in that it reports
timestamps from the application's view, e.g. RTT.
- Timestamping the first segment has the downside that it may
underreport tx host network latency. It appears that we have to pick
one or the other. And possibly follow-up with a config flag to choose
behavior.
v2: Remove tests as noted by Willem de Bruijn <willemb@google.com>
Moving tests from net to net-next
v3: Update only relevant tx_flag bits as per
Willem de Bruijn <willemb@google.com>
v4: Update comments and commit message as per
Willem de Bruijn <willemb@google.com>
Fixes: ee80d1ebe5 ("udp: add udp gso")
Signed-off-by: Fred Klassen <fklassen@appneta.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski says:
====================
net: netem: fix issues with corrupting GSO frames
Corrupting GSO frames currently leads to crashes, due to skb use
after free. These stem from the skb list handling - the segmented
skbs come back on a list, and this list is not properly unlinked
before enqueuing the segments. Turns out this condition is made
very likely to occur because of another bug - in backlog accounting.
Segments are counted twice, which means qdisc's limit gets reached
leading to drops and making the use after free very likely to happen.
The bugs are fixed in order in which they were added to the tree.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Brendan reports that the use of netem's packet corruption capability
leads to strange crashes. This seems to be caused by
commit d66280b12b ("net: netem: use a list in addition to rbtree")
which uses skb->next pointer to construct a fast-path queue of
in-order skbs.
Packet corruption code has to invoke skb_gso_segment() in case
of skbs in need of GSO. skb_gso_segment() returns a list of
skbs. If next pointers of the skbs on that list do not get cleared
fast path list may point to freed skbs or skbs which are also on
the RB tree.
Let's say skb gets segmented into 3 frames:
A -> B -> C
A gets hooked to the t_head t_tail list by tfifo_enqueue(), but it's
next pointer didn't get cleared so we have:
h t
|/
A -> B -> C
Now if B and C get also get enqueued successfully all is fine, because
tfifo_enqueue() will overwrite the list in order. IOW:
Enqueue B:
h t
| |
A -> B C
Enqueue C:
h t
| |
A -> B -> C
But if B and C get reordered we may end up with:
h t RB tree
|/ |
A -> B -> C B
\
C
Or if they get dropped just:
h t
|/
A -> B -> C
where A and B are already freed.
To reproduce either limit has to be set low to cause freeing of
segs or reorders have to happen (due to delay jitter).
Note that we only have to mark the first segment as not on the
list, "finish_segs" handling of other frags already does that.
Another caveat is that qdisc_drop_all() still has to free all
segments correctly in case of drop of first segment, therefore
we re-link segs before calling it.
v2:
- re-link before drop, v1 was leaking non-first segs if limit
was hit at the first seg
- better commit message which lead to discovering the above :)
Reported-by: Brendan Galloway <brendan.galloway@netronome.com>
Fixes: d66280b12b ("net: netem: use a list in addition to rbtree")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When GSO frame has to be corrupted netem uses skb_gso_segment()
to produce the list of frames, and re-enqueues the segments one
by one. The backlog length has to be adjusted to account for
new frames.
The current calculation is incorrect, leading to wrong backlog
lengths in the parent qdisc (both bytes and packets), and
incorrect packet backlog count in netem itself.
Parent backlog goes negative, netem's packet backlog counts
all non-first segments twice (thus remaining non-zero even
after qdisc is emptied).
Move the variables used to count the adjustment into local
scope to make 100% sure they aren't used at any stage in
backports.
Fixes: 6071bd1aa1 ("netem: Segment GSO packets on enqueue")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the call to device_property_read_u32_array is not error checked
leading to potential garbage values in the delays array that are then used
in msleep delays. Add a sanity check to the property fetching.
Addresses-Coverity: ("Uninitialized scalar variable")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Left shifting the signed int value 1 by 31 bits has undefined behaviour
and the shift amount oq_no can be as much as 63. Fix this by using
BIT_ULL(oq_no) instead.
Addresses-Coverity: ("Bad shift operation")
Fixes: f21fb3ed36 ("Add support of Cavium Liquidio ethernet adapters")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"git diff" says:
\ No newline at end of file
after modifying the file.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long says:
====================
net: fix quite a few dst_cache crashes reported by syzbot
There are two kinds of crashes reported many times by syzbot with no
reproducer. Call Traces are like:
BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190
net/ipv4/route.c:1556
rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556
__mkroute_output net/ipv4/route.c:2332 [inline]
ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564
ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393
__ip_route_output_key include/net/route.h:125 [inline]
ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651
ip_route_output_key include/net/route.h:135 [inline]
...
or:
kasan: GPF could be caused by NULL-ptr deref or user memory access
RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168
<IRQ>
rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline]
free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2437 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
...
They were caused by the fib_nh_common percpu member 'nhc_pcpu_rth_output'
overwritten by another percpu variable 'dev->tstats' access overflow in
tipc udp media xmit path when counting packets on a non tunnel device.
The fix is to make udp tunnel work with no tunnel device by allowing not
to count packets on the tstats when the tunnel dev is NULL in Patches 1/3
and 2/3, then pass a NULL tunnel dev in tipc_udp_tunnel() in Patch 3/3.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device
to count packets on dev->tstats, a perpcu variable. However, TIPC is using
udp tunnel with no tunnel device, and pass the lower dev, like veth device
that only initializes dev->lstats(a perpcu variable) when creating it.
Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as
a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each
pointer points to a bigger struct than lstats, so when tstats->tx_bytes is
increased, other percpu variable's members could be overwritten.
syzbot has reported quite a few crashes due to fib_nh_common percpu member
'nhc_pcpu_rth_output' overwritten, call traces are like:
BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190
net/ipv4/route.c:1556
rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556
__mkroute_output net/ipv4/route.c:2332 [inline]
ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564
ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393
__ip_route_output_key include/net/route.h:125 [inline]
ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651
ip_route_output_key include/net/route.h:135 [inline]
...
or:
kasan: GPF could be caused by NULL-ptr deref or user memory access
RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168
<IRQ>
rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline]
free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2437 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
...
The issue exists since tunnel stats update is moved to iptunnel_xmit by
Commit 039f50629b ("ip_tunnel: Move stats update to iptunnel_xmit()"),
and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb
so that the packets counting won't happen on dev->tstats.
Reported-by: syzbot+9d4c12bfd45a58738d0a@syzkaller.appspotmail.com
Reported-by: syzbot+a9e23ea2aa21044c2798@syzkaller.appspotmail.com
Reported-by: syzbot+c4c4b2bb358bb936ad7e@syzkaller.appspotmail.com
Reported-by: syzbot+0290d2290a607e035ba1@syzkaller.appspotmail.com
Reported-by: syzbot+a43d8d4e7e8a7a9e149e@syzkaller.appspotmail.com
Reported-by: syzbot+a47c5f4c6c00fc1ed16e@syzkaller.appspotmail.com
Fixes: 039f50629b ("ip_tunnel: Move stats update to iptunnel_xmit()")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A similar fix to Patch "ip_tunnel: allow not to count pkts on tstats by
setting skb's dev to NULL" is also needed by ip6_tunnel.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
iptunnel_xmit() works as a common function, also used by a udp tunnel
which doesn't have to have a tunnel device, like how TIPC works with
udp media.
In these cases, we should allow not to count pkts on dev's tstats, so
that udp tunnel can work with no tunnel device safely.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
v2->v3: fixed issues in backtracking pointed out by Andrii.
The next step is to add a lot more tests for backtracking.
v1->v2: addressed Andrii's feedback.
this patch set introduces verifier support for bounded loops and
adds several other improvements.
Ideally they would be introduced one at a time,
but to support bounded loop the verifier needs to 'step back'
in the patch 1. That patch introduces tracking of spill/fill
of constants through the stack. Though it's a useful feature
it hurts cilium tests.
Patch 3 introduces another feature by extending is_branch_taken
logic to 'if rX op rY' conditions. This feature is also
necessary to support bounded loops.
Then patch 4 adds support for the loops while adding
key heuristics with jmp_processed.
Introduction of parentage chain of verifier states in patch 4
allows patch 9 to add backtracking of precise scalar registers
which finally resolves degradation from patch 1.
The end result is much faster verifier for existing programs
and new support for loops.
See patch 8 for many kinds of loops that are now validated.
Patch 9 is the most tricky one and could be rewritten with
a different algorithm in the future.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Introduce precision tracking logic that
helps cilium programs the most:
old clang old clang new clang new clang
with all patches with all patches
bpf_lb-DLB_L3.o 1838 2283 1923 1863
bpf_lb-DLB_L4.o 3218 2657 3077 2468
bpf_lb-DUNKNOWN.o 1064 545 1062 544
bpf_lxc-DDROP_ALL.o 26935 23045 166729 22629
bpf_lxc-DUNKNOWN.o 34439 35240 174607 28805
bpf_netdev.o 9721 8753 8407 6801
bpf_overlay.o 6184 7901 5420 4754
bpf_lxc_jit.o 39389 50925 39389 50925
Consider code:
654: (85) call bpf_get_hash_recalc#34
655: (bf) r7 = r0
656: (15) if r8 == 0x0 goto pc+29
657: (bf) r2 = r10
658: (07) r2 += -48
659: (18) r1 = 0xffff8881e41e1b00
661: (85) call bpf_map_lookup_elem#1
662: (15) if r0 == 0x0 goto pc+23
663: (69) r1 = *(u16 *)(r0 +0)
664: (15) if r1 == 0x0 goto pc+21
665: (bf) r8 = r7
666: (57) r8 &= 65535
667: (bf) r2 = r8
668: (3f) r2 /= r1
669: (2f) r2 *= r1
670: (bf) r1 = r8
671: (1f) r1 -= r2
672: (57) r1 &= 255
673: (25) if r1 > 0x1e goto pc+12
R0=map_value(id=0,off=0,ks=20,vs=64,imm=0) R1_w=inv(id=0,umax_value=30,var_off=(0x0; 0x1f))
674: (67) r1 <<= 1
675: (0f) r0 += r1
At this point the verifier will notice that scalar R1 is used in map pointer adjustment.
R1 has to be precise for later operations on R0 to be validated properly.
The verifier will backtrack the above code in the following way:
last_idx 675 first_idx 664
regs=2 stack=0 before 675: (0f) r0 += r1 // started backtracking R1 regs=2 is a bitmask
regs=2 stack=0 before 674: (67) r1 <<= 1
regs=2 stack=0 before 673: (25) if r1 > 0x1e goto pc+12
regs=2 stack=0 before 672: (57) r1 &= 255
regs=2 stack=0 before 671: (1f) r1 -= r2 // now both R1 and R2 has to be precise -> regs=6 mask
regs=6 stack=0 before 670: (bf) r1 = r8 // after this insn R8 and R2 has to be precise
regs=104 stack=0 before 669: (2f) r2 *= r1 // after this one R8, R2, and R1
regs=106 stack=0 before 668: (3f) r2 /= r1
regs=106 stack=0 before 667: (bf) r2 = r8
regs=102 stack=0 before 666: (57) r8 &= 65535
regs=102 stack=0 before 665: (bf) r8 = r7
regs=82 stack=0 before 664: (15) if r1 == 0x0 goto pc+21
// this is the end of verifier state. The following regs will be marked precised:
R1_rw=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R7_rw=invP(id=0)
parent didn't have regs=82 stack=0 marks // so backtracking continues into parent state
last_idx 663 first_idx 655
regs=82 stack=0 before 663: (69) r1 = *(u16 *)(r0 +0) // R1 was assigned no need to track it further
regs=80 stack=0 before 662: (15) if r0 == 0x0 goto pc+23 // keep tracking R7
regs=80 stack=0 before 661: (85) call bpf_map_lookup_elem#1 // keep tracking R7
regs=80 stack=0 before 659: (18) r1 = 0xffff8881e41e1b00
regs=80 stack=0 before 658: (07) r2 += -48
regs=80 stack=0 before 657: (bf) r2 = r10
regs=80 stack=0 before 656: (15) if r8 == 0x0 goto pc+29
regs=80 stack=0 before 655: (bf) r7 = r0 // here the assignment into R7
// mark R0 to be precise:
R0_rw=invP(id=0)
parent didn't have regs=1 stack=0 marks // regs=1 -> tracking R0
last_idx 654 first_idx 644
regs=1 stack=0 before 654: (85) call bpf_get_hash_recalc#34 // and in the parent frame it was a return value
// nothing further to backtrack
Two scalar registers not marked precise are equivalent from state pruning point of view.
More details in the patch comments.
It doesn't support bpf2bpf calls yet and enabled for root only.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add a bunch of loop tests. Most of them are created by replacing
'#pragma unroll' with '#pragma clang loop unroll(disable)'
Several tests are artificially large:
/* partial unroll. llvm will unroll loop ~150 times.
* C loop count -> 600.
* Asm loop count -> 4.
* 16k insns in loop body.
* Total of 5 such loops. Total program size ~82k insns.
*/
"./pyperf600.o",
/* no unroll at all.
* C loop count -> 600.
* ASM loop count -> 600.
* ~110 insns in loop body.
* Total of 5 such loops. Total program size ~1500 insns.
*/
"./pyperf600_nounroll.o",
/* partial unroll. 19k insn in a loop.
* Total program size 20.8k insn.
* ~350k processed_insns
*/
"./strobemeta.o",
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This set of tests is a rewrite of Edward's earlier tests:
https://patchwork.ozlabs.org/patch/877221/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The commit 7640ead939 partially resolved the issue of callees
incorrectly pruning the callers.
With introduction of bounded loops and jmps_processed heuristic
single verifier state may contain multiple branches and calls.
It's possible that new verifier state (for future pruning) will be
allocated inside callee. Then callee will exit (still within the same
verifier state). It will go back to the caller and there R6-R9 registers
will be read and will trigger mark_reg_read. But the reg->live for all frames
but the top frame is not set to LIVE_NONE. Hence mark_reg_read will fail
to propagate liveness into parent and future walking will incorrectly
conclude that the states are equivalent because LIVE_READ is not set.
In other words the rule for parent/live should be:
whenever register parentage chain is set the reg->live should be set to LIVE_NONE.
is_state_visited logic already follows this rule for spilled registers.
Fixes: 7640ead939 ("bpf: verifier: make sure callees don't prune with caller differences")
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow the verifier to validate the loops by simulating their execution.
Exisiting programs have used '#pragma unroll' to unroll the loops
by the compiler. Instead let the verifier simulate all iterations
of the loop.
In order to do that introduce parentage chain of bpf_verifier_state and
'branches' counter for the number of branches left to explore.
See more detailed algorithm description in bpf_verifier.h
This algorithm borrows the key idea from Edward Cree approach:
https://patchwork.ozlabs.org/patch/877222/
Additional state pruning heuristics make such brute force loop walk
practical even for large loops.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch extends is_branch_taken() logic from JMP+K instructions
to JMP+X instructions.
Conditional branches are often done when src and dst registers
contain known scalars. In such case the verifier can follow
the branch that is going to be taken when program executes.
That speeds up the verification and is essential feature to support
bounded loops.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
fix tests that incorrectly assumed that the verifier
cannot track constants through stack.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Compilers often spill induction variables into the stack,
hence it is necessary for the verifier to track scalar values
of the registers through stack slots.
Also few bpf programs were incorrectly rejected in the past,
since the verifier was not able to track such constants while
they were used to compute offsets into packet headers.
Tracking constants through the stack significantly decreases
the chances of state pruning, since two different constants
are considered to be different by state equivalency.
End result that cilium tests suffer serious degradation in the number
of states processed and corresponding verification time increase.
before after
bpf_lb-DLB_L3.o 1838 6441
bpf_lb-DLB_L4.o 3218 5908
bpf_lb-DUNKNOWN.o 1064 1064
bpf_lxc-DDROP_ALL.o 26935 93790
bpf_lxc-DUNKNOWN.o 34439 123886
bpf_netdev.o 9721 31413
bpf_overlay.o 6184 18561
bpf_lxc_jit.o 39389 359445
After further debugging turned out that cillium progs are
getting hurt by clang due to the same constant tracking issue.
Newer clang generates better code by spilling less to the stack.
Instead it keeps more constants in the registers which
hurts state pruning since the verifier already tracks constants
in the registers:
old clang new clang
(no spill/fill tracking introduced by this patch)
bpf_lb-DLB_L3.o 1838 1923
bpf_lb-DLB_L4.o 3218 3077
bpf_lb-DUNKNOWN.o 1064 1062
bpf_lxc-DDROP_ALL.o 26935 166729
bpf_lxc-DUNKNOWN.o 34439 174607
bpf_netdev.o 9721 8407
bpf_overlay.o 6184 5420
bpf_lcx_jit.o 39389 39389
The final table is depressing:
old clang old clang new clang new clang
const spill/fill const spill/fill
bpf_lb-DLB_L3.o 1838 6441 1923 8128
bpf_lb-DLB_L4.o 3218 5908 3077 6707
bpf_lb-DUNKNOWN.o 1064 1064 1062 1062
bpf_lxc-DDROP_ALL.o 26935 93790 166729 380712
bpf_lxc-DUNKNOWN.o 34439 123886 174607 440652
bpf_netdev.o 9721 31413 8407 31904
bpf_overlay.o 6184 18561 5420 23569
bpf_lxc_jit.o 39389 359445 39389 359445
Tracking constants in the registers hurts state pruning already.
Adding tracking of constants through stack hurts pruning even more.
The later patch address this general constant tracking issue
with coarse/precise logic.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add const qualifiers to bpf_object/bpf_program/bpf_map arguments for
getter APIs. There is no need for them to not be const pointers.
Verified that
make -C tools/lib/bpf
make -C tools/testing/selftests/bpf
make -C tools/perf
all build without warnings.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Each function that manipulates the aa_ext struct should reset it's "pos"
member on failure. This ensures that, on failure, no changes are made to
the state of the aa_ext struct.
There are paths were elements are optional and the error path is
used to indicate the optional element is not present. This means
instead of just aborting on error the unpack stream can become
unsynchronized on optional elements, if using one of the affected
functions.
Cc: stable@vger.kernel.org
Fixes: 736ec752d9 ("AppArmor: policy routines for loading and unpacking policy")
Signed-off-by: Mike Salvatore <mike.salvatore@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
A packed AppArmor policy contains null-terminated tag strings that are read
by unpack_nameX(). However, unpack_nameX() uses string functions on them
without ensuring that they are actually null-terminated, potentially
leading to out-of-bounds accesses.
Make sure that the tag string is null-terminated before passing it to
strcmp().
Cc: stable@vger.kernel.org
Fixes: 736ec752d9 ("AppArmor: policy routines for loading and unpacking policy")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
While commit 11c236b89d ("apparmor: add a default null dfa") ensure
every profile has a policy.dfa it does not resize the policy.start[]
to have entries for every possible start value. Which means
PROFILE_MEDIATES is not safe to use on untrusted input. Unforunately
commit b9590ad4c4 ("apparmor: remove POLICY_MEDIATES_SAFE") did not
take into account the start value usage.
The input string in profile_query_cb() is user controlled and is not
properly checked to be within the limited start[] entries, even worse
it can't be as userspace policy is allowed to make us of entries types
the kernel does not know about. This mean usespace can currently cause
the kernel to access memory up to 240 entries beyond the start array
bounds.
Cc: stable@vger.kernel.org
Fixes: b9590ad4c4 ("apparmor: remove POLICY_MEDIATES_SAFE")
Signed-off-by: John Johansen <john.johansen@canonical.com>
When inserting a new mmap entry to the xarray we should check for
'mmap_page' overflow as it is limited to 32 bits.
Fixes: 40909f664d ("RDMA/efa: Add EFA verbs implementation")
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl0JE18ACgkQxWXV+ddt
WDskyw//fi4r3/D2lz7ZKHRz00o8h9gDyBw50y4FHXN44GrAWgM0dbi1l0VXRquy
A9Xy4rBoquDdUKkSIvMr3ff33eK+Y2O9oUBeXMcb4tfg8TicAMdyTHduDwka1Ljg
TXcupG8cWd7Boi9FfeSDDPQpV+NXxzL9VSy7uTMjMmmxWYWViMJo38ultsxUhoOY
ZVNY8LNT/F6i/4Us9D5ymzqn6uQWzfu2GXZT2I3Pq5Ps3PBSc7OJVhrPYE9FQ/jv
ptirddkoeOo6xQJ1Pb/UMPjkTZ5ct2Wy/lAvQPXiWf9FjjwR7zuSL1Xe3wzpUg/y
llENXp3Ps+oMzTF1XKid43yHlt0Swqzr+EIiNvLbSj5E5o5msKZ+jXQYHV8LHZfW
I12uizChQc3N11SrwC+gVou5oAMhTnJmHjlq96vWXaw0lcvX+yHMWP3w16OGM3Pc
9aN0ap6SgRBM6XZkXR65Rf4sIAnCm12hDrVdHNPJhz95W6PQZkPhMS0FWjrOAUYy
yqhrMqtY4ELNRBBBXJ4dq+k+l/I6lHkPbhPXVt0VekXdyKPr4o5WZLI1k3C2S14L
wXrqw6wTU4pgaD6LCqQ7NFtCZF3Zz+8L+MHbbK1LLlMFUcMFs/+cRrg7EKipOxxn
mm/mjOKD5lGUJPGKuFb87rCcgAOXcOWnF+FoR/FNf6HVEJJJOuY=
=uzhr
-----END PGP SIGNATURE-----
Merge tag 'for-5.2-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression where properties stored as xattrs are not properly
persisted
- a small readahead fix (the fstests testcase for that fix hangs on
unpatched kernel, so we'd like get it merged to ease future testing)
- fix a race during block group creation and deletion
* tag 'for-5.2-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix failure to persist compression property xattr deletion on fsync
btrfs: start readahead also in seed devices
Btrfs: fix race between block group removal and block group allocation
I've been bad at collecting fixes this release cycle, so this is
a fairly large batch that's been trickling in for a while.
It's the usual mix, more or less:
Some of the bigger things fixed:
- Voltage fix for MMC on TI DRA7 that sometimes would overvoltage cards
- Regression fixes for D_CAN on am355x
- i.MX6SX cpuidle fix to deal with wakeup latency (dropped uart chars)
- DT fixes for some DRA7 variants that don't share the superset of
blocks on the chip
+ The usual mix of stuff -- minor build/warning fixes, Kconfig
dependencies, and some DT fixlets.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCAAtFiEElf+HevZ4QCAJmMQ+jBrnPN6EHHcFAl0I67gPHG9sb2ZAbGl4
b20ubmV0AAoJEIwa5zzehBx3Hf4QAKKdaDz3V0v/+znx13YZdOzW7xXsO8Znd2n8
L9o86PCh57Auo7/wdYPcNxM0f/HHabnJdFeHQ2d9CZjQu54xOH59G/zpqcdd3zo0
4DEucZmagZaXrrSjFbh8OTks8ogT+IbSrPHniRenKXc5MhD7Ewlfz3AUMeB3gU+t
Z1KhA49j5OA2nTSOc8EgjykE2YfejbPwujZ+VROYDpiczYlQb6/TO1JXMSsX/APC
2LMWlPFtm6NnmiQJEUV8CRl+3X2TBrjjbmOszN2Ra27XV10Te8FMNx5A4vAIH3N5
KDzfh/zNO6nNgvxdqlnPE7sFXfWs3Nqo62M52hxRoWJap4y83F1S87kgW2vJ6+gm
HaHo/zEsE3kz8er3Geeobuc7ldL0qEisWZodGJWQnCOfXXAota9xHe5CFiPCCWJ3
Bouvep7JTA2qrUiu+kW5R+I1v0e2oCLDkfE1f8uZkiv2cDbEagLJfORp4qwAvWs/
tgOFX6+HtfQr7ZRdgvfgyYJlsRMuP5IeZpUZ7bfsyIcJlG6dBEoqzs590BXNLuTt
4gxFrSBcVgkHtesusJKpqvMDBF+/xItXq1ggvQU2rpE1XPfIZy9Pv8sbEFsl8UDY
Abq+eFdt12bdAu2ZjFy+IXmp8KFEgW8w7U5rGDOpMyqzwAOErSw40zFYk1Y94MV2
ZHvsSNR3
=pAiT
-----END PGP SIGNATURE-----
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Olof Johansson:
"I've been bad at collecting fixes this release cycle, so this is a
fairly large batch that's been trickling in for a while.
It's the usual mix, more or less.
Some of the bigger things fixed:
- Voltage fix for MMC on TI DRA7 that sometimes would overvoltage
cards
- Regression fixes for D_CAN on am355x
- i.MX6SX cpuidle fix to deal with wakeup latency (dropped uart
chars)
- DT fixes for some DRA7 variants that don't share the superset of
blocks on the chip
plus the usual mix of stuff: minor build/warning fixes, Kconfig
dependencies, and some DT fixlets"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (28 commits)
soc: ixp4xx: npe: Fix an IS_ERR() vs NULL check in probe
ARM: ixp4xx: include irqs.h where needed
ARM: ixp4xx: mark ixp4xx_irq_setup as __init
ARM: ixp4xx: don't select SERIAL_OF_PLATFORM
firmware: trusted_foundations: add ARMv7 dependency
MAINTAINERS: Change QCOM repo location
ARM: davinci: da8xx: specify dma_coherent_mask for lcdc
ARM: davinci: da850-evm: call regulator_has_full_constraints()
ARM: mvebu_v7_defconfig: fix Ethernet on Clearfog
ARM: dts: am335x phytec boards: Fix cd-gpios active level
ARM: dts: dra72x: Disable usb4_tm target module
arm64: arch_k3: Fix kconfig dependency warning
ARM: dts: Drop bogus CLKSEL for timer12 on dra7
MAINTAINERS: Update Stefan Wahren email address
ARM: dts: bcm: Add missing device_type = "memory" property
soc: bcm: brcmstb: biuctrl: Register writes require a barrier
soc: brcmstb: Fix error path for unsupported CPUs
ARM: dts: dra71x: Disable usb4_tm target module
ARM: dts: dra71x: Disable rtc target module
ARM: dts: dra76x: Disable usb4_tm target module
...
Currently after setting tap0 link up, the tun code wakes tx/rx waited
queues up in tun_net_open() when .ndo_open() is called, however the
IFF_UP flag has not been set yet. If there's already a wait queue, it
would fail to transmit when checking the IFF_UP flag in tun_sendmsg().
Then the saving vhost_poll_start() will add the wq into wqh until it
is waken up again. Although this works when IFF_UP flag has been set
when tun_chr_poll detects; this is not true if IFF_UP flag has not
been set at that time. Sadly the latter case is a fatal error, as
the wq will never be waken up in future unless later manually
setting link up on purpose.
Fix this by moving the wakeup process into the NETDEV_UP event
notifying process, this makes sure IFF_UP has been set before all
waited queues been waken up.
Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A previous attempt to shut up the uninitialized variable use
warning was apparently insufficient. When CONFIG_PROFILE_ANNOTATED_BRANCHES
is set, gcc-8 still warns, because the unlikely() check in DP_NOTICE()
causes it to no longer track the state of all variables correctly:
drivers/net/ethernet/qlogic/qed/qed_dev.c: In function 'qed_llh_set_ppfid_affinity':
drivers/net/ethernet/qlogic/qed/qed_dev.c:798:47: error: 'abs_ppfid' may be used uninitialized in this function [-Werror=maybe-uninitialized]
addr = NIG_REG_PPF_TO_ENGINE_SEL + abs_ppfid * 0x4;
~~~~~~~~~~^~~~~
This is not a nice workaround, but always initializing the output from
qed_llh_abs_ppfid() at least shuts up the false positive reliably.
Fixes: 79284adeb9 ("qed: Add llh ppfid interface and 100g support for offload protocols")
Fixes: 8e2ea3ea96 ("qed: Fix static checker warning")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flexible array members should be denoted using [] instead of [0], else
gcc will not warn when they are no longer at the end of a struct.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>