When performing foo-over-UDP, UDP packets are processed by the
encapsulation handler which returns another protocol to process.
This may result in processing two (or more) protocols in the
loop that are marked as INET6_PROTO_FINAL. The actions taken
for hitting a final protocol, in particular the skb_postpull_rcsum
can only be performed once.
This patch set adds a check of a final protocol has been seen. The
rules are:
- If the final protocol has not been seen any protocol is processed
(final and non-final). In the case of a final protocol, the final
actions are taken (like the skb_postpull_rcsum)
- If a final protocol has been seen (e.g. an encapsulating UDP
header) then no further non-final protocols are allowed
(e.g. extension headers). For more final protocols the
final actions are not taken (e.g. skb_postpull_rcsum).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6_input_finish the nexthdr protocol is retrieved from the
next header offset that is returned in the cb of the skb.
This method does not work for UDP encapsulation that may not
even have a concept of a nexthdr field (e.g. FOU).
This patch checks for a final protocol (INET6_PROTO_FINAL) when a
protocol handler returns > 0. If the protocol is not final then
resubmission is performed on nhoff value. If the protocol is final
then the nexthdr is taken to be the return value.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
NETIF_F_GSO_IPXIP6. These are used to described IP in IP
tunnel and what the outer protocol is. The inner protocol
can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
are removed (these are both instances of SKB_GSO_IPXIP4).
SKB_GSO_IPXIP6 will be used when support for GSO with IP
encapsulation over IPv6 is added.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In several gso_segment functions there are checks of gso_type against
a seemingly arbitrary list of SKB_GSO_* flags. This seems like an
attempt to identify unsupported GSO types, but since the stack is
the one that set these GSO types in the first place this seems
unnecessary to do. If a combination isn't valid in the first
place that stack should not allow setting it.
This is a code simplication especially for add new GSO types.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
May, we managed to uncover and fix some important bugs in our
new B.A.T.M.A.N. V algorithm. These are the fixes we came up with
together with others that I collected in the past weeks:
- avoid potential crash due to NULL pointer dereference in
B.A.T.M.A.N. V routine when a neigh_ifinfo object is not found, by
Sven Eckelmann
- avoid use-after-free of skb when counting outgoing bytes, by Florian
Westphal
- fix neigh_ifinfo object reference counting imbalance when using
B.A.T.M.A.N. V, by Sven Eckelmann. Such imbalance may lead to the
impossibility of releasing the related netdev object on shutdown
- avoid invalid memory access in case of error while allocating
bcast_own_sum when a new hard-interface is added, by Sven Eckelmann
- ensure originator address is updated in OMG/ELP packet content upon
primary interface address change, by Antonio Quartulli
- fix integer overflow when computing TQ metric (B.A.T.M.A.N. IV), by
Sven Eckelmann
- avoid race condition while adding new neigh_node which would result
in having two objects mapping to the same physical neighbour, by
Linus Lüssing
- ensure originator address is initialized in ELP packet content on
secondary interfaces, by Marek Lindner
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXO+ZsAAoJEJ4aZjxxc6bK7UEQALxcPqv2YtQhQCexHAPlmTxg
73uXuXpo0eWPxJh2FElcsHTHSt/l0/m/G8MEWCOfqI8MTeg+UTY5lXCY4iThtiY6
n08nl/CLkPNLuDGzAd7CpCM2jJ1iAzgJEVpUGQ6qoO5ekFvO09U/J/o6E8gDTX4I
yc1kR3hcJeLbF6EUXEfTX/2THdFhkYSgHgZoGyM3eu43Zhv/6v454iEpqNxA/Ubr
UngJTGSfqf20FidUt+xjbfptltTzj25Cp+VQ2DQ/LOuit0ChECjXwMfk0UkDHlKX
bYPm1tcUJ0dzbR7cILMlzzzCSzH00leUk5GYYo59rAxBQiJ2j+92c2K0dcnv8cJJ
q+GsZapyQFBIgPWS/TLtG1II3u8YWYvl+nEQXBDtPeRAjGoO4VwSbuatsf+enM8K
i0c622iEc0b3cVe5H8reUh7SZMfMT/cQkEgpX5/psf8FWvEzeDv4HI3aUfVuSpck
mEZAy59HeLWrF+KTScougv14vyoMmKfUcr5E1Oe4X1Ke9dpVvHGiI9wlqb/J0lfl
rIc/5WI3ae/QAXkcPybanPEO8O0pBMP+Y4oCYpWmDu1qYfwZzDaSMqqEALQVVW2I
8Ta1gbzamgZDslf7YDSWOR0iyFAFcnMhT0qU6gv2+RGbcfwD+AGRLu5ufY/th0N3
L6AuFhvT/bxW6jN+iuAS
=IACg
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
During the Wireless Battle Mesh v9 in Porto (PT) at the beginning of
May, we managed to uncover and fix some important bugs in our
new B.A.T.M.A.N. V algorithm. These are the fixes we came up with
together with others that I collected in the past weeks:
- avoid potential crash due to NULL pointer dereference in
B.A.T.M.A.N. V routine when a neigh_ifinfo object is not found, by
Sven Eckelmann
- avoid use-after-free of skb when counting outgoing bytes, by Florian
Westphal
- fix neigh_ifinfo object reference counting imbalance when using
B.A.T.M.A.N. V, by Sven Eckelmann. Such imbalance may lead to the
impossibility of releasing the related netdev object on shutdown
- avoid invalid memory access in case of error while allocating
bcast_own_sum when a new hard-interface is added, by Sven Eckelmann
- ensure originator address is updated in OMG/ELP packet content upon
primary interface address change, by Antonio Quartulli
- fix integer overflow when computing TQ metric (B.A.T.M.A.N. IV), by
Sven Eckelmann
- avoid race condition while adding new neigh_node which would result
in having two objects mapping to the same physical neighbour, by
Linus Lüssing
- ensure originator address is initialized in ELP packet content on
secondary interfaces, by Marek Lindner
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP stack can now run from process context.
Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous
assumption.
Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP stack can now run from process context.
Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous
assumption.
Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_splice_bits() returns int, kcm_splice_read() returns ssize_t,
both are signed.
We may need another patch to make them all ssize_t, but that
deserves a separated patch.
Fixes: 91687355b9 ("kcm: Splice support")
Reported-by: David Binderman <linuxdev.baldrick@gmail.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull security subsystem updates from James Morris:
"Highlights:
- A new LSM, "LoadPin", from Kees Cook is added, which allows forcing
of modules and firmware to be loaded from a specific device (this
is from ChromeOS, where the device as a whole is verified
cryptographically via dm-verity).
This is disabled by default but can be configured to be enabled by
default (don't do this if you don't know what you're doing).
- Keys: allow authentication data to be stored in an asymmetric key.
Lots of general fixes and updates.
- SELinux: add restrictions for loading of kernel modules via
finit_module(). Distinguish non-init user namespace capability
checks. Apply execstack check on thread stacks"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (48 commits)
LSM: LoadPin: provide enablement CONFIG
Yama: use atomic allocations when reporting
seccomp: Fix comment typo
ima: add support for creating files using the mknodat syscall
ima: fix ima_inode_post_setattr
vfs: forbid write access when reading a file into memory
fs: fix over-zealous use of "const"
selinux: apply execstack check on thread stacks
selinux: distinguish non-init user namespace capability checks
LSM: LoadPin for kernel file loading restrictions
fs: define a string representation of the kernel_read_file_id enumeration
Yama: consolidate error reporting
string_helpers: add kstrdup_quotable_file
string_helpers: add kstrdup_quotable_cmdline
string_helpers: add kstrdup_quotable
selinux: check ss_initialized before revalidating an inode label
selinux: delay inode label lookup as long as possible
selinux: don't revalidate an inode's label when explicitly setting it
selinux: Change bool variable name to index.
KEYS: Add KEYCTL_DH_COMPUTE command
...
This fix prevents nodes to wrongly create a 00:00:00:00:00:00 originator
which can potentially interfere with the rest of the neighbor statistics.
Fixes: d6f94d91f7 ("batman-adv: ELP - adding basic infrastructure")
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The undefined behavior sanatizer detected an signed integer overflow in a
setup with near perfect link quality
UBSAN: Undefined behaviour in net/batman-adv/bat_iv_ogm.c:1246:25
signed integer overflow:
8713350 * 255 cannot be represented in type 'int'
The problems happens because the calculation of mixed unsigned and signed
integers resulted in an integer multiplication.
batadv_ogm_packet::tq (u8 255)
* tq_own (u8 255)
* tq_asym_penalty (int 134; max 255)
* tq_iface_penalty (int 255; max 255)
The tq_iface_penalty, tq_asym_penalty and inv_asym_penalty can just be
changed to unsigned int because they are not expected to become negative.
Fixes: c039876892 ("batman-adv: add WiFi penalty")
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
When the MAC address of the primary interface is changed,
update the originator address in the ELP and OGM skb buffers as
well in order to reflect the change.
Fixes: d6f94d91f7 ("batman-adv: ELP - adding basic infrastructure")
Reported-by: Marek Lindner <marek@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The function batadv_iv_ogm_orig_add_if allocates new buffers for bcast_own
and bcast_own_sum. It is expected that these buffers are unchanged in case
either bcast_own or bcast_own_sum couldn't be resized.
But the error handling of this function frees the already resized buffer
for bcast_own when the allocation of the new bcast_own_sum buffer failed.
This will lead to an invalid memory access when some code will try to
access bcast_own.
Instead the resized new bcast_own buffer has to be kept. This will not lead
to problems because the size of the buffer was only increased and therefore
no user of the buffer will try to access bytes outside of the new buffer.
Fixes: d0015fdd3d ("batman-adv: provide orig_node routing API")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The functions batadv_neigh_ifinfo_get increase the reference counter of the
batadv_neigh_ifinfo. These have to be reduced again when the reference is
not used anymore to correctly free the objects.
Fixes: 9786906022 ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_neigh_ifinfo_get can return NULL when it cannot find (even when only
temporarily) anymore the neigh_ifinfo in the list neigh->ifinfo_list. This
has to be checked to avoid kernel Oopses when the ifinfo is dereferenced.
This a situation which isn't expected but is already handled by functions
like batadv_v_neigh_cmp. The same kind of warning is therefore used before
the function returns without dereferencing the pointers.
Fixes: 9786906022 ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_send_skb_to_orig() calls dev_queue_xmit() so we can't use skb->len.
Fixes: 953324776d ("batman-adv: network coding - buffer unicast packets before forward")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Pull networking updates from David Miller:
"Highlights:
1) Support SPI based w5100 devices, from Akinobu Mita.
2) Partial Segmentation Offload, from Alexander Duyck.
3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.
4) Allow cls_flower stats offload, from Amir Vadai.
5) Implement bpf blinding, from Daniel Borkmann.
6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.
7) Run TCP more preemptibly, also from Eric Dumazet.
8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.
9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.
10) Improve BPF usage documentation, from Jesper Dangaard Brouer.
11) Support tunneling offloads in qed, from Manish Chopra.
12) aRFS offloading in mlx5e, from Maor Gottlieb.
13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.
14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.
15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.
16) Per-vlan stats in bridging, from Nikolay Aleksandrov.
17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.
18) Checksum neutral ILA in ipv6, from Tom Herbert.
19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot
20) Add VF support to qed driver, from Yuval Mintz"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...
Pull 'struct path' constification update from Al Viro:
"'struct path' is passed by reference to a bunch of Linux security
methods; in theory, there's nothing to stop them from modifying the
damn thing and LSM community being what it is, sooner or later some
enterprising soul is going to decide that it's a good idea.
Let's remove the temptation and constify all of those..."
* 'work.const-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
constify ima_d_path()
constify security_sb_pivotroot()
constify security_path_chroot()
constify security_path_{link,rename}
apparmor: remove useless checks for NULL ->mnt
constify security_path_{mkdir,mknod,symlink}
constify security_path_{unlink,rmdir}
apparmor: constify common_perm_...()
apparmor: constify aa_path_link()
apparmor: new helper - common_path_perm()
constify chmod_common/security_path_chmod
constify security_sb_mount()
constify chown_common/security_path_chown
tomoyo: constify assorted struct path *
apparmor_path_truncate(): path->mnt is never NULL
constify vfs_truncate()
constify security_path_truncate()
[apparmor] constify struct path * in a bunch of helpers
This merges the Qualcomm SOC tree with the net-next, solving the
merge conflict in the SMD API between the two.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXNk3mAAoJEAsfOT8Nma3FJkEQAI/0hXy5cmACx37crT6Zishn
+IjfnVFY0w+VWX6DdOVvAiNufMEBnvbo6jAGFxYALvZTRyt2kBxa4SgbQfgeDGp2
UIXU5vYjYamggBALcDc9b6xBxxYEdIjGpqueIrGPyLzuAHxC74Ss4a1UYE8kC8R4
lznB+BCuyv4WQL5GgWK2vTcFH6BD9Yw6FARLI2YJaR1oRrvl+3WCnPSKZUYPi31R
vxvGYV5yiZ5YyOyZX4VHpn2vk1+dD7mhzXS/X1cGdV3Ysaqdw3bXqM0C31HW29F5
qxnVNtbnoNMIpQQaFRMT00x7yx+12gCXXgb7Q1CDzvXFQXkGvQRlcHagmvi7dqI9
YrttwbmS9LgezzEuaUTNfOJpMS9x7ghSayzHEmVLn9/7IMxdmqUiu5f9uKMjQHsW
D3hWCTahVXf9Y+UWKOEtvnazyEcdQIuuBEzZQw05cl8OntGk1eWtHLikZVyW+zWO
lgWX2n84Bc3BPz6o0Nu+tW4WbTlo5irnA8YLeBRJNUFgwlwlhm0Io2Q7aVhW27jZ
rDPfil1KduOGgf0T2EMD9XdQf0tXMaJ7y7vzuGtaWy83kYaEDzYOmnLK39LQYlxm
yYoU5Yo9fsoFXTvLyD/3FBD8Wqblv5C8ObIB4Lc3uhmrLBYA2mjFAmsxqDV2Giuz
So/uZlqnJauYFnMBAHUu
=UpGH
-----END PGP SIGNATURE-----
Merge tag 'net-next-qcom-soc-4.7-2-merge' of git://github.com/andersson/kernel
Merge tag 'qcom-soc-for-4.7-2' into net-next
This merges the Qualcomm SOC tree with the net-next, solving the
merge conflict in the SMD API between the two.
Pull parallel filesystem directory handling update from Al Viro.
This is the main parallel directory work by Al that makes the vfs layer
able to do lookup and readdir in parallel within a single directory.
That's a big change, since this used to be all protected by the
directory inode mutex.
The inode mutex is replaced by an rwsem, and serialization of lookups of
a single name is done by a "in-progress" dentry marker.
The series begins with xattr cleanups, and then ends with switching
filesystems over to actually doing the readdir in parallel (switching to
the "iterate_shared()" that only takes the read lock).
A more detailed explanation of the process from Al Viro:
"The xattr work starts with some acl fixes, then switches ->getxattr to
passing inode and dentry separately. This is the point where the
things start to get tricky - that got merged into the very beginning
of the -rc3-based #work.lookups, to allow untangling the
security_d_instantiate() mess. The xattr work itself proceeds to
switch a lot of filesystems to generic_...xattr(); no complications
there.
After that initial xattr work, the series then does the following:
- untangle security_d_instantiate()
- convert a bunch of open-coded lookup_one_len_unlocked() to calls of
that thing; one such place (in overlayfs) actually yields a trivial
conflict with overlayfs fixes later in the cycle - overlayfs ended
up switching to a variant of lookup_one_len_unlocked() sans the
permission checks. I would've dropped that commit (it gets
overridden on merge from #ovl-fixes in #for-next; proper resolution
is to use the variant in mainline fs/overlayfs/super.c), but I
didn't want to rebase the damn thing - it was fairly late in the
cycle...
- some filesystems had managed to depend on lookup/lookup exclusion
for *fs-internal* data structures in a way that would break if we
relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.
- core of that series - parallel lookup machinery, replacing
->i_mutex with rwsem, making lookup_slow() take it only shared. At
that point lookups happen in parallel; lookups on the same name
wait for the in-progress one to be done with that dentry.
Surprisingly little code, at that - almost all of it is in
fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
making it use the new primitive and actually switching to locking
shared.
- parallel readdir stuff - first of all, we provide the exclusion on
per-struct file basis, same as we do for read() vs lseek() for
regular files. That takes care of most of the needed exclusion in
readdir/readdir; however, these guys are trickier than lookups, so
I went for switching them one-by-one. To do that, a new method
'->iterate_shared()' is added and filesystems are switched to it
as they are either confirmed to be OK with shared lock on directory
or fixed to be OK with that. I hope to kill the original method
come next cycle (almost all in-tree filesystems are switched
already), but it's still not quite finished.
- several filesystems get switched to parallel readdir. The
interesting part here is dealing with dcache preseeding by readdir;
that needs minor adjustment to be safe with directory locked only
shared.
Most of the filesystems doing that got switched to in those
commits. Important exception: NFS. Turns out that NFS folks, with
their, er, insistence on VFS getting the fuck out of the way of the
Smart Filesystem Code That Knows How And What To Lock(tm) have
grown the locking of their own. They had their own homegrown
rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
is the reader there). Of course, with VFS getting the fuck out of
the way, as requested, the actual smarts of the smart filesystem
code etc. had become exposed...
- do_last/lookup_open/atomic_open cleanups. As the result, open()
without O_CREAT locks the directory only shared. Including the
->atomic_open() case. Backmerge from #for-linus in the middle of
that - atomic_open() fix got brought in.
- then comes NFS switch to saner (VFS-based ;-) locking, killing the
homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
exclusion for sillyunlink/lookup is done by the parallel lookups
mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
now - rmdir being the writer.
Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
now.
- the rest of the series consists of switching a lot of filesystems
to parallel readdir; in a lot of cases ->llseek() gets simplified
as well. One backmerge in there (again, #for-linus - rockridge
fix)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
ext4: switch to ->iterate_shared()
hfs: switch to ->iterate_shared()
hfsplus: switch to ->iterate_shared()
hostfs: switch to ->iterate_shared()
hpfs: switch to ->iterate_shared()
hpfs: handle allocation failures in hpfs_add_pos()
gfs2: switch to ->iterate_shared()
f2fs: switch to ->iterate_shared()
afs: switch to ->iterate_shared()
befs: switch to ->iterate_shared()
befs: constify stuff a bit
isofs: switch to ->iterate_shared()
get_acorn_filename(): deobfuscate a bit
btrfs: switch to ->iterate_shared()
logfs: no need to lock directory in lseek
switch ecryptfs to ->iterate_shared
9p: switch to ->iterate_shared()
fat: switch to ->iterate_shared()
romfs, squashfs: switch to ->iterate_shared()
more trivial ->iterate_shared conversions
...
The problem is that fib_info->nh is [0] so the struct fib_info
allocation size depends on number of nexthops. If we just copy fib_info,
we do not copy the nexthops info and driver accesses memory which is not
ours.
Given the fact that fib4 does not defer operations and therefore it does
not need copy, just pass the pointer down to drivers as it was done
before.
Fixes: 850d0cbc91 ("switchdev: remove pointers from switchdev objects")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We saw the following extra refcount release on veth device:
kernel: [7957821.463992] unregister_netdevice: waiting for mesos50284 to become free. Usage count = -1
Since we heavily use mirred action to redirect packets to veth, I think
this is caused by the following race condition:
CPU0:
tcf_mirred_release(): (in RCU callback)
struct net_device *dev = rcu_dereference_protected(m->tcfm_dev, 1);
CPU1:
mirred_device_event():
spin_lock_bh(&mirred_list_lock);
list_for_each_entry(m, &mirred_list, tcfm_list) {
if (rcu_access_pointer(m->tcfm_dev) == dev) {
dev_put(dev);
/* Note : no rcu grace period necessary, as
* net_device are already rcu protected.
*/
RCU_INIT_POINTER(m->tcfm_dev, NULL);
}
}
spin_unlock_bh(&mirred_list_lock);
CPU0:
tcf_mirred_release():
spin_lock_bh(&mirred_list_lock);
list_del(&m->tcfm_list);
spin_unlock_bh(&mirred_list_lock);
if (dev) // <======== Stil refers to the old m->tcfm_dev
dev_put(dev); // <======== dev_put() is called on it again
The action init code path is good because it is impossible to modify
an action that is being removed.
So, fix this by moving everything under the spinlock.
Fixes: 2ee22a90c7 ("net_sched: act_mirred: remove spinlock in fast path")
Fixes: 6bd00b8506 ("act_mirred: fix a race condition on mirred_list")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The publication field of the old netlink API should contain the
publication key and not the publication reference.
Fixes: 44a8ae94fd (tipc: convert legacy nl name table dump to nl compat)
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.
When we free cb->skb after a dump, we do it after releasing the
lock. This means that a new dump could have started in the time
being and we'll end up freeing their skb instead of ours.
This patch saves the skb and module before we unlock so we free
the right memory.
Fixes: 16b304f340 ("netlink: Eliminate kmalloc in netlink dump operation.")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure the socket for which the user is listing publication exists
before parsing the socket netlink attributes.
Prior to this patch a call without any socket caused a NULL pointer
dereference in tipc_nl_publ_dump().
Tested-and-reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.cm>
Signed-off-by: David S. Miller <davem@davemloft.net>
memory_usage must be decreased in dequeue_func(), not in
fq_codel_dequeue(), otherwise packets dropped by Codel algo
are missing this decrease.
Also we need to clear memory_usage in fq_codel_reset()
Fixes: 95b58430ab ("fq_codel: add memory limitation per queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Follow-up for 8a3a4c6e7b ("net: make sch_handle_ingress() drop
monitor ready") to also make the egress side drop monitor ready.
Also here only TC_ACT_SHOT is a clear indication that something
went wrong. Hence don't provide false positives to drop monitors
such as 'perf record -e skb:kfree_skb ...'.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function setup_timer combines the initialization of a timer with the
initialization of the timer's function and data fields. The mulitiline
code for timer initialization is now replaced with function setup_timer.
Also, quoting the mod_timer() function comment:
-> mod_timer() is a more efficient way to update the expire field of an
active timer (if the timer is inactive it will be activated).
Use setup_timer() and mod_timer() to setup and arm a timer, making the
code compact and aid readablity.
Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-05-14
Here are two more Bluetooth patches for the 4.7 kernel which we wanted
to get into net-next before the merge window opens. Please let me know
if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the blinding is strictly only called from inside eBPF JITs,
we need to change signatures for bpf_int_jit_compile() and
bpf_prog_select_runtime() first in order to prepare that the
eBPF program we're dealing with can change underneath. Hence,
for call sites, we need to return the latest prog. No functional
change in this patch.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split the HAVE_BPF_JIT into two for distinguishing cBPF and eBPF JITs.
Current cBPF ones:
# git grep -n HAVE_CBPF_JIT arch/
arch/arm/Kconfig:44: select HAVE_CBPF_JIT
arch/mips/Kconfig:18: select HAVE_CBPF_JIT if !CPU_MICROMIPS
arch/powerpc/Kconfig:129: select HAVE_CBPF_JIT
arch/sparc/Kconfig:35: select HAVE_CBPF_JIT
Current eBPF ones:
# git grep -n HAVE_EBPF_JIT arch/
arch/arm64/Kconfig:61: select HAVE_EBPF_JIT
arch/s390/Kconfig:126: select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
arch/x86/Kconfig:94: select HAVE_EBPF_JIT if X86_64
Later code also needs this facility to check for eBPF JITs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Besides others, remove redundant comments where the code is self
documenting enough, and properly indent various bpf_verifier_ops
and bpf_prog_type_list declarations. Moreover, remove two exports
that actually have no module user.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_hdr() is slightly more expensive than using skb->data in contexts
where we know they point to the same byte.
In receive path, tcp_v4_rcv() and tcp_v6_rcv() are in this situation,
as tcp header has not been pulled yet.
In output path, the same can be said when we just pushed the tcp header
in the skb, in tcp_transmit_skb() and tcp_make_synack()
Also factorize the two checks for tcb->tcp_flags & TCPHDR_SYN in
tcp_transmit_skb() and pass tcp header pointer to tcp_ecn_send(),
so that compiler can further optimize and avoid a reload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__sock_cmsg_send() might return different error codes, not only -EINVAL.
Fixes: 24025c465f ("ipv4: process socket-level control messages in IPv4")
Fixes: ad1e46a837 ("ipv6: process socket-level control messages in IPv6")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having multiple loadable modules with the same name cannot work
with modprobe, and having both net/qrtr/smd.ko and drivers/soc/qcom/smd.ko
results in a (somewhat cryptic) build error:
ERROR: "qcom_smd_driver_unregister" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_driver_register" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_set_drvdata" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_send" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_get_drvdata" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_driver_unregister" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_driver_register" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_set_drvdata" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_send" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_get_drvdata" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
Also, the qrtr driver uses the SMD interface and has a Kconfig dependency,
but also allows for compile-testing when SMD is disabled. However, if
with QCOM_SMD=m and COMPILE_TEST=y we can end up with QRTR_SMD=y and
that fails with a related link error.
The changes the dependency so we can still compile-test the driver but
not have it built-in if SMD is a module, to avoid running in the broken
configuration, and changes the Makefile to provide the driver under
a different module name.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: bdabad3e36 ("net: Add Qualcomm IPC router")
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new command in ndo_setup_tc() for hardware offloaded
filters, to call the NIC driver, and make it update the statistics.
This will be done before dumping the filter and its statistics.
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement the stats_update callback that will be called by NIC drivers
for hardware offloaded filters.
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On devices that support TC U32 offloads, this flag enables a filter to be
added only to HW. skip-sw and skip-hw are mutually exclusive flags. By
default without any flags, the filter is added to both HW and SW, but no
error checks are done in case of failure to add to HW. With skip-sw,
failure to add to HW is treated as an error.
Here is a sample script that adds 2 filters, one with skip-sw and the other
with skip-hw flag.
# add ingress qdisc
tc qdisc add dev p4p1 ingress
# enable hw tc offload.
ethtool -K p4p1 hw-tc-offload on
# add u32 filter with skip-sw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:1 u32 ht 800: flowid 800:1 \
skip-sw \
match ip src 192.168.1.0/24 \
action drop
# add u32 filter with skip-hw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:2 u32 ht 800: flowid 800:2 \
skip-hw \
match ip src 192.168.2.0/24 \
action drop
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next'
because we no longer have a per-netns conntrack hash.
The ip_gre.c conflict as well as the iwlwifi ones were cases of
overlapping changes.
Conflicts:
drivers/net/wireless/intel/iwlwifi/mvm/tx.c
net/ipv4/ip_gre.c
net/netfilter/nf_conntrack_core.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Switchdev has been around for quite a while now, putting "EXPERIMENTAL"
in the description is no longer accurate, drop it.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when creating or updating a route, no check is performed
in both ipv4 and ipv6 code to the hoplimit value.
The caller can i.e. set hoplimit to 256, and when such route will
be used, packets will be sent with hoplimit/ttl equal to 0.
This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
ipv6 route code, substituting any value greater than 255 with 255.
This is consistent with what is currently done for ADVMSS and MTU
in the ipv4 code.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The slab name ends up being visible in the directory structure under
/sys, and even if you don't have access rights to the file you can see
the filenames.
Just use a 64-bit counter instead of the pointer to the 'net' structure
to generate a unique name.
This code will go away in 4.7 when the conntrack code moves to a single
kmemcache, but this is the backportable simple solution to avoiding
leaking kernel pointers to user space.
Fixes: 5b3501faa8 ("netfilter: nf_conntrack: per netns nf_conntrack_cachep")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
* Change SMD callback parameters
* Use writecombine mapping for SMEM
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXFvdZAAoJEFKiBbHx2RXVxPQQAJnEp1+5CDN7ziQ/zJ3juSxb
b7rOr5Z0qMDhh0H9BFdmhGzFctbY0AOqerfh138pXa6KN7maXT/enUE4vvtWUp1B
HCr9xt13waoa8e6umj/D1qU5mMLUX+JraVkIo6Dyd1g6pVmdrbwzP+Z0tW0x7CvW
n5r9eDYts66vGbQ6xzK40rTi22sEN32AY4mezKj40L8o0NvfZCCwb3dfw2RQzLHK
0KnSQ0PKqU1ZWrDc2YuvRsyh5KGv+t8vvnpMVVUi4TRx3qzoGpHpH3fE3evvMGKY
ya6nmFNFPZ+ZlzsGDx10RqLu58NmjS8LXY6O+yCGy9NPuonKpwHgYD2wxut2NR/W
TTtMhvJWKNIwhXR8ZRDaVBWQJBwoLvpNQT1zOIUuUKf+SZ2DbfgVpte7NZvrgEka
aY4nnZMbI0MvpaAHUtllWbC4xqMQJXY+4wrahKQgKF/l5H/0XQ99fHHz86DS9LCO
y+dfmU98dv/1X8qmxNfsULtJwbxi2Xwzt3MMMZb5v138STyMWeqWOLm+KGzgs1hu
Pc/RLQALdJpQEcg74s5xt2yxSqRuzDig7FtRT1a2SX33vi9YohQEgXCDe+IOJC5l
sm1Ppi3oO/yUcocUDFTBz/g6zD4SWywu2gmBcZIh/JVy8Sd053WwlaKs1Nld6ta7
QPx6rga7lgrYW409RS2A
=Jzqj
-----END PGP SIGNATURE-----
Merge tag 'qcom-soc-for-4.7-2' into net-next
This merges the Qualcomm SOC tree with the net-next, solving the
merge conflict in the SMD API between the two.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
With all the latest fixes applied, I am still able to reproduce this
(and other) warning(s):
WARNING: CPU: 1 PID: 19684 at ../kernel/workqueue.c:4092 destroy_workqueue+0x70a/0x770()
...
Call Trace:
[<ffffffff819fee81>] ? dump_stack+0xb3/0x112
[<ffffffff8117377e>] ? warn_slowpath_common+0xde/0x140
[<ffffffff811ce68a>] ? destroy_workqueue+0x70a/0x770
[<ffffffff811739ae>] ? warn_slowpath_null+0x2e/0x40
[<ffffffff811ce68a>] ? destroy_workqueue+0x70a/0x770
[<ffffffffa0c944c9>] ? hci_unregister_dev+0x2a9/0x720 [bluetooth]
[<ffffffffa0b301db>] ? vhci_release+0x7b/0xf0 [hci_vhci]
[<ffffffffa0b30160>] ? vhci_flush+0x50/0x50 [hci_vhci]
[<ffffffff8117cd73>] ? do_exit+0x863/0x2b90
This is due to race present in the hci_unregister_dev path.
hdev->power_on work races with hci_dev_do_close. One tries to open,
the other tries to close, leading to warning like the above. (Another
example is a warning in kobject_get or kobject_put depending on who
wins the race.)
Fix this by switching those two racers to ensure hdev->power_on never
triggers while hci_dev_do_close is in progress.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
While testing an OpenStack configuration using VXLANs I saw the following
call trace:
RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80
RSP: 0018:ffff88103867bc50 EFLAGS: 00010286
RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff
RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780
RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0
R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794
R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0
FS: 0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0
Stack:
ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17
ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080
ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794
Call Trace:
[<ffffffff81576515>] ? skb_checksum+0x35/0x50
[<ffffffff815733c0>] ? skb_push+0x40/0x40
[<ffffffff815fcc17>] udp_gro_receive+0x57/0x130
[<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0
[<ffffffff81605863>] inet_gro_receive+0x1d3/0x270
[<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0
[<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120
[<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan]
[<ffffffff815899d0>] net_rx_action+0x160/0x380
[<ffffffff816965c7>] __do_softirq+0xd7/0x2c5
[<ffffffff8107d969>] run_ksoftirqd+0x29/0x50
[<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160
[<ffffffff8109a400>] ? sort_range+0x30/0x30
[<ffffffff81096da8>] kthread+0xd8/0xf0
[<ffffffff81693c82>] ret_from_fork+0x22/0x40
[<ffffffff81096cd0>] ? kthread_park+0x60/0x60
The following trace is seen when receiving a DHCP request over a flow-based
VXLAN tunnel. I believe this is caused by the metadata dst having a NULL
dev value and as a result dev_net(dev) is causing a NULL pointer dereference.
To resolve this I am replacing the check for skb_dst(skb)->dev with just
skb->dev. This makes sense as the callers of this function are usually in
the receive path and as such skb->dev should always be populated. In
addition other functions in the area where these are called are already
using dev_net(skb->dev) to determine the namespace the UDP packet belongs
in.
Fixes: 63058308cd ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sunrpc is using SOCKWQ_ASYNC_NOSPACE without setting SOCK_FASYNC,
so the recent optimizations done in sk_set_bit() and sk_clear_bit()
broke it.
There is still the risk that a subsequent sock_fasync() call
would clear SOCK_FASYNC, but sunrpc does not use this yet.
Fixes: 9317bb6982 ("net: SOCKWQ_ASYNC_NOSPACE optimizations")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jiri Pirko <jiri@resnulli.us>
Reported-by: Huang, Ying <ying.huang@intel.com>
Tested-by: Jiri Pirko <jiri@resnulli.us>
Tested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>