drivers/net/ethernet/aquantia/atlantic/aq_nic.c: 991:1:
warning: no newline at end of file
Signed-off-by: Nikita Danilov <nikita.danilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not careful array dereference caused analysis tools
to think there could be memory overflow.
There was actually no corruption because the array is
two dimensional.
drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c: 140
aq_ethtool_get_strings() error:
memcpy() '*aq_ethtool_stat_names' too small (32 vs 704)
Signed-off-by: Nikita Danilov <nikita.danilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers.
But for ip -6 route, currently we only support tcp, udp and icmp.
Add ICMPv6 support so we can match ipv6-icmp rules for route lookup.
v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to
rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family.
Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: eacb9384a3 ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MIPS eBPF JIT calls flush_icache_range() in order to ensure the
icache observes the code that we just wrote. Unfortunately it gets the
end address calculation wrong due to some bad pointer arithmetic.
The struct jit_ctx target field is of type pointer to u32, and as such
adding one to it will increment the address being pointed to by 4 bytes.
Therefore in order to find the address of the end of the code we simply
need to add the number of 4 byte instructions emitted, but we mistakenly
add the number of instructions multiplied by 4. This results in the call
to flush_icache_range() operating on a memory region 4x larger than
intended, which is always wasteful and can cause crashes if we overrun
into an unmapped page.
Fix this by correcting the pointer arithmetic to remove the bogus
multiplication, and use braces to remove the need for a set of brackets
whilst also making it obvious that the target field is a pointer.
Signed-off-by: Paul Burton <paul.burton@mips.com>
Fixes: b6bd53f9c4 ("MIPS: Add missing file for eBPF JIT.")
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add the upcoming ConnectX-6 Dx.
In addition, add "ConnectX Family mlx5Gen Virtual Function" device ID.
Every new HCA VF will be identified with this device ID. Different VF
models will be distinguished by their revision id.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Update the predicate that determines if to duplicate rules installed on
vport reps to account also for the multipath case.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The current check only validates if both netdevs use the same ops
which means both are vf reps or both uplink reps.
Unlike the case where the two uplinks are bonded (VF LAG), under
multipath scheme the switchdev parent id is not unified between the
uplink reps (and all the associated vf reps). However, we still want
to duplicate in the driver encap flows, adjust the merged eswitch
check for that matter.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
As part of creating the tunnel headers while offloading TC encap rules,
we resolve the route and neighbour in order to get the source /
destination mac.
Since the way we offload multipath route is by having two HW rules,
one per uplink port, doing naive route lookup might get us a "wrong"
routing path which goes through the peer uplink and this will get us
eventually to create a wrong L2 header for the tunnel.
To avoid that, we use a device hint to get the correct route.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Under multipath when encap rules are duplicated to HW in the driver,
it's possible for one flow to be currently un-offloaded (e.g. lack of
next-hop route or neigh entry) while the other flow is offloaded. As
such, we move to query the counters of both flows at all times.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Under multipath it's possible for us to offload the flow only through
the e-switch for which proper route through the uplink exists.
When the port is up and the next-hop route is set again we want to
offload through it as well.
We generate SW event from the FIB event handler when multipath port
affinity changes. The tc offloads code gets this event, goes over the
flows which were marked as of having missing route and attempts to
offload them.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Under multipath offload scheme, as part of handling fib events, emit
mlx5 port affinity event on the enabled ports which will be handled by
the tc offloads code.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In a similar manner to uplink/VF LAG, under multipath we add encap peer
rule on the second port as well.
However, unlike the LAG case, we do want to allow failure for adding
one of the rules. This happens due to using a routing hint while doing
the route lookup when one path (next hop device) is down.
Introduce a new flag to indicate that route lookup failed for encap
flow. Note that a flow may still not be offloaded to hw due to missing
neighbour, in that case, the neigh update event will take care of it.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently the peer flow inherits the flags from the original flow
after we've set it. At this time the flags are set according to
the flow state, e.g marked as going to slow path and such.
Even if not getting us to real bugs now, this opens the door to
get us to troubles later. Future proof the code and avoid the
inheritance, use the peer flags as were set on input when we
started adding the original flow.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
To support multipath offload we are going to track SW multipath route
and related nexthops. To do that we register to FIB notifier and handle
the route and next-hops events and reflect that as port affinity to HW.
When there is a new multipath route entry that all next-hops are the
ports of an HCA we will activate LAG in HW.
Egress wise, we use HW LAG as the means to emulate multipath on current
HW which doesn't support port selection based on xmit hash. In the
presence of multiple VFs which use multiple SQs (send queues) this
yields fairly good distribution.
HA wise, HW LAG buys us the ability for a given RQ (receive queue) to
receive traffic from both ports and for SQs to migrate xmitting over
the active port if their base port fails.
When the route entry is being updated to single path we will update
the HW port affinity to use that port only.
If a next-hop becomes dead we update the HW port affinity to the living
port.
When all next-hops are alive again we reset the affinity to default.
Due to FW/HW limitations, when a route is deleted we are not disabling
the HW LAG since doing so will not allow us to enable it again while
VFs are bounded. Typically this is just a temporary state when a
routing daemon removes dead routes and later adds them back as needed.
This patch only handles events for AF_INET.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In order to offload ecmp-on-host scheme where next-hop routes are used,
we will make use of HW LAG. Add accessor function to let upper layers
in the driver to realize if the lag acts in multi-path mode.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Instead of using the system workqueue, allocate our own workqueue.
This workqueue will be used to handle more work in the next patch.
This patch doesn't change functionality.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The change is a refactoring step towards a multipath use case.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This fix checkpatch check
CHECK: Avoid using bool structure members because of possible alignment
issues
Signed-off-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
EAGAIN is treated as a specific case when we consider the attachment
successful but wait for neigh event before offloading the flow.
This can result in unwanted behavior when sub calls on the offloading
path will return EAGAIN and we pass this error up.
Instead of attaching to a specific error code return a boolean value
from the attach encap operation saying if the encap is valid or not.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Remove the tunnel info argument which we can get from the other args.
Also reorder the args to have input args first and output args later.
This patch doesn't change functionality.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Function mlx5e_tx_reporter_recover_from_ctx is only used within mlx5e tx
reporter, move it to be statically declared in en/reporter_tx.c.
Fixes: de8650a820 ("net/mlx5e: Add tx reporter support")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Jakub Kicinski says:
====================
nfp: control processor DMA support and RJ45
This series starts with adding support for reporting twisted pair
media type in ethtool.
Remaining patches add support for using DMA with the control/service
processor. Currently we always copy the command data into card's
memory. DMA support allows us to have the NSP read the data from
host memory by itself. Unfortunately, the FW loading and flashing
cannot directly map the buffers for DMA because (a) the firmware
ABI returns const buffers, and (b) the buffers may be vmalloc()ed
in many mysterious/unmappable way. So just bite the bullet -
allocate new host buffer for the command and copy.
As Dirk explains, the NSP now supports updating all FWs at once
which means the max flashing time grew significantly. He bumps
the max wait to avoid timeouts.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The management firmware now supports being passed a bundle with
multiple components to be stored in flash at once. This makes it
easier to update all components to a known state with a single
user command, however, this also has the potential to increase
the time required to perform the update significantly.
The management firmware only updates the components out of a bundle
which are outdated, however, we need to make sure we can handle
the absolute worst case where a CPLD update can take a long time
to perform.
We set a very conservative total timeout of 900s which already
adds a contingency.
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Newer versions of NSP can access host memory. Simplest access
type requires all data to be in one contiguous area. Since we
don't have the guarantee on where callers of the NSP ABI will
allocate their buffers we allocate a bounce buffer and copy
the data in and out.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DMA version of NSP communication is coming, move the code which
copies data into the NFP buffer into a separate function.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NSP expresses the buffer size in MB and 4 kB blocks. For small
buffers the kB part may make a difference, so count it in.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for reporting twisted pair port type.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It has been observed that tx queue stalls while downloading
from certain web sites (example www.speedtest.net)
The cause has been tracked down to a corner case where
dma descriptors where not setup properly. And there for a tx
completion interrupt was not signaled.
This fix corrects the problem by properly marking the end of
a multi descriptor transmission.
Fixes: 23f0703c12 ("lan743x: Add main source files for new lan743x driver")
Signed-off-by: Bryan Whitehead <Bryan.Whitehead@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When debugging an issue I found implausible values in state->pause.
Reason in that state->pause isn't initialized and later only single
bits are changed. Also the struct itself isn't initialized in
phylink_resolve(). So better initialize state->pause and other
not yet initialized fields.
v2:
- use right function name in subject
v3:
- initialize additional fields
Fixes: 9525ae8395 ("phylink: add phylink infrastructure")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently the maximum number of queues was increased up to 8, but
NIC was not fully configured for 8 queues. In setups with more than 4 CPU
cores parts of TX traffic gets lost if the kernel routes it to queues 4th-8th.
This patch sets a tx hw traffic mode with 8 queues.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202651
Fixes: 71a963cfc5 ("net: aquantia: increase max number of hw queues")
Reported-by: Nicholas Johnson <nicholas.johnson@outlook.com.au>
Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current implementation for UDP GRO tests is racy: the receiver
may flush the RX queue while the sending is still transmitting and
incorrectly report RX errors, with a wrong number of packet received.
Add explicit timeouts to the receiver for both connection activation
(first packet received for UDP) and reception completion, so that
in the above critical scenario the receiver will wait for the
transfer completion.
Fixes: 3327a9c463 ("selftests: add functionals test for UDP GRO")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The comphy driver for Armada 3700 by Miquèl Raynal (which is currently
in linux-next) does not actually set comphy mode when phy_set_mode_ext
is called. The mode is set at next call of phy_power_on.
Update the driver to semantics similar to mvpp2: helper
mvneta_comphy_init sets comphy mode and powers it on.
When mode is to be changed in mvneta_mac_config, first power the comphy
off, then call mvneta_comphy_init (which sets the mode to new one).
Only do this when new mode is different from old mode.
This should also work for Armada 38x, since in that comphy driver
methods power_on and power_off are unimplemented.
Signed-off-by: Marek Behún <marek.behun@nic.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Claudiu Manoil says:
====================
enetc: Add mdio support and device tree nodes
This is the missing part to enable PCI probing of the ENETC ethernet
ports on the LS1028A SoC and external traffic on the LS1028A RDB board.
It's one of the first items on the TODO list for the recently merged
ENETC ethernet driver.
v3: Add DT bindings doc for ENETC connections
v4: none
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Define connection bindings (external PHY connections and internal links)
for the ENETC on-chip ethernet controllers.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each ENETC PF has its own MDIO interface, the corresponding
MDIO registers are mapped in the ENETC's Port register block.
The current patch adds a driver for these PF level MDIO buses,
so that each PF can manage directly its own external link.
Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The LS1028A RDB board features an Atheros PHY connected over
SGMII to the ENETC PF0 (or Port0). ENETC Port1 (PF1) has no
external connection on this board, so it can be disabled for now.
Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The LS1028A SoC features a PCI Integrated Endpoint Root Complex
(IERC) defining several integrated PCI devices, including the ENETC
ethernet controller integrated endpoints (IEPs). The IERC implements
ECAM (Enhanced Configuration Access Mechanism) to provide access
to the PCIe config space of the IEPs. This means the the IEPs
(including ENETC) do not support the standard PCIe BARs, instead
the Enhanced Allocation (EA) capability structures in the ECAM space
are used to fix the base addresses in the system, and the PCI
subsystem uses these structures for device enumeration and discovery.
The "ranges" entries contain basic information from these EA capabily
structures required by the kernel for device enumeration.
The current patch also enables the first 2 ENETC PFs (Physiscal
Functions) and the associated VFs (Virtual Functions), 2 VFs for
each PF. Each of these ENETC PFs has an external ethernet port
on the LS1028A SoC.
Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In bpf/syscall.c, map_create() first set map->usercnt to 1, a file
descriptor is supposed to return to userspace. When bpf_map_new_fd()
fails, drop the refcount.
Fixes: bd5f5f4ecb ("bpf: Add BPF_MAP_GET_FD_BY_ID")
Signed-off-by: Peng Sun <sironhide0null@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Merge the ipv4 and ipv6 nat chain type. This is the last
missing piece which allows to provide inet family support
for nat in a follow patch.
The kconfig knobs for ipv4/ipv6 nat chain are removed, the
nat chain type will be built unconditionally if NFT_NAT
expression is enabled.
Before:
text data bss dec hex filename
1576 896 0 2472 9a8 nft_chain_nat_ipv4.ko
1697 896 0 2593 a21 nft_chain_nat_ipv6.ko
After:
text data bss dec hex filename
1832 896 0 2728 aa8 nft_chain_nat.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The family specific masq modules are way too small to warrant
an extra module, just place all of them in nft_masq.
before:
text data bss dec hex filename
1001 832 0 1833 729 nft_masq.ko
766 896 0 1662 67e nft_masq_ipv4.ko
764 896 0 1660 67c nft_masq_ipv6.ko
after:
2010 960 0 2970 b9a nft_masq.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
before:
text data bss dec hex filename
990 832 0 1822 71e nft_redir.ko
697 896 0 1593 639 nft_redir_ipv4.ko
713 896 0 1609 649 nft_redir_ipv6.ko
after:
text data bss dec hex filename
1910 960 0 2870 b36 nft_redir.ko
size is reduced, all helpers from nft_redir.ko can be made static.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use struct device_attribute instead of struct idletimer_tg_attr, and
the correct callback function type to avoid indirect call mismatches
with Control Flow Integrity checking.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
CONNTRACK_LOCKS is divisor when computer array index, if it is power of
2, compiler will optimize modulo operation as bitwise AND, or else
modulo will lower performance.
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Check the result of dereferencing base_chain->stats, instead of result
of this_cpu_ptr with NULL.
base_chain->stats maybe be changed to NULL when a chain is updated and a
new NULL counter can be attached.
And we do not need to check returning of this_cpu_ptr since
base_chain->stats is from percpu allocator if it is non-NULL,
this_cpu_ptr returns a valid value.
And fix two sparse error by replacing rcu_access_pointer and
rcu_dereference with READ_ONCE under rcu_read_lock.
Thanks for Eric's help to finish this patch.
Fixes: 009240940e ("netfilter: nf_tables: don't assume chain stats are set when jumplabel is set")
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Followup to a173f066c7 ("netfilter: bridge: Don't sabotage nf_hook
calls from an l3mdev"). Some packets (e.g., ndisc) do not have the skb
device flipped to the l3mdev (e.g., VRF) device. Update ip_sabotage_in
to not drop packets for slave devices too. Currently, neighbor
solicitation packets for 'dev -> bridge (addr) -> vrf' setups are getting
dropped. This patch enables IPv6 communications for bridges with an
address that are enslaved to a VRF.
Fixes: 73e20b761a ("net: vrf: Add support for PREROUTING rules on vrf device")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
sctp_csum_check() is called by sctp_s/dnat_handler() where it calls
skb_make_writable() to ensure sctphdr to be linearized.
So there's no need to get sctphdr by calling skb_header_pointer()
in sctp_csum_check().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The proto in struct xt_match and struct xt_target is u16, when
calling xt_check_target/match, their proto argument is u8,
and will cause truncation, it is harmless to ip packet, since
ip proto is u8
if a etable's match/target has proto that is u16, will cause
the check failure.
and convert be16 to short in bridge/netfilter/ebtables.c
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The metadata_dst does not initialize the dst_cache field, this causes
problems to ip_md_tunnel_xmit() since it cannot use this cache, hence,
Triggering a route lookup for every packet.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
TCP resets cause instant transition from established to closed state
provided the reset is in-window. Endpoints that implement RFC 5961
require resets to match the next expected sequence number.
RST segments that are in-window (but that do not match RCV.NXT) are
ignored, and a "challenge ACK" is sent back.
Main problem for conntrack is that its a middlebox, i.e. whereas an end
host might have ACK'd SEQ (and would thus accept an RST with this
sequence number), conntrack might not have seen this ACK (yet).
Therefore we can't simply flag RSTs with non-exact match as invalid.
This updates RST processing as follows:
1. If the connection is in a state other than ESTABLISHED, nothing is
changed, RST is subject to normal in-window check.
2. If the RSTs sequence number either matches exactly RCV.NXT,
connection state moves to CLOSE.
3. The same applies if the RST sequence number aligns with a previous
packet in the same direction.
In all other cases, the connection remains in ESTABLISHED state.
If the normal-in-window check passes, the timeout will be lowered
to that of CLOSE.
If the peer sends a challenge ack, connection timeout will be reset.
If the challenge ACK triggers another RST (RST was valid after all),
this 2nd RST will match expected sequence and conntrack state changes to
CLOSE.
If no challenge ACK is received, the connection will time out after
CLOSE seconds (10 seconds by default), just like without this patch.
Packetdrill test case:
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
// Receive a segment.
0.210 < P. 1:1001(1000) ack 1 win 46
0.210 > . 1:1(0) ack 1001
// Application writes 1000 bytes.
0.250 write(4, ..., 1000) = 1000
0.250 > P. 1:1001(1000) ack 1001
// First reset, old sequence. Conntrack (correctly) considers this
// invalid due to failed window validation (regardless of this patch).
0.260 < R 2:2(0) ack 1001 win 260
// 2nd reset, but too far ahead sequence. Same: correctly handled
// as invalid.
0.270 < R 99990001:99990001(0) ack 1001 win 260
// in-window, but not exact sequence.
// Current Linux kernels might reply with a challenge ack, and do not
// remove connection.
// Without this patch, conntrack state moves to CLOSE.
// With patch, timeout is lowered like CLOSE, but connection stays
// in ESTABLISHED state.
0.280 < R 1010:1010(0) ack 1001 win 260
// Expect challenge ACK
0.281 > . 1001:1001(0) ack 1001 win 501
// With or without this patch, RST will cause connection
// to move to CLOSE (sequence number matches)
// 0.282 < R 1001:1001(0) ack 1001 win 260
// ACK
0.300 < . 1001:1001(0) ack 1001 win 257
// more data could be exchanged here, connection
// is still established
// Client closes the connection.
0.610 < F. 1001:1001(0) ack 1001 win 260
0.650 > . 1001:1001(0) ack 1002
// Close the connection without reading outstanding data
0.700 close(4) = 0
// so one more reset. Will be deemed acceptable with patch as well:
// connection is already closing.
0.701 > R. 1001:1001(0) ack 1002 win 501
// End packetdrill test case.
With patch, this generates following conntrack events:
[NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [UNREPLIED]
[UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80
[UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
[UPDATE] 120 FIN_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
[UPDATE] 60 CLOSE_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
[UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
Without patch, first RST moves connection to close, whereas socket state
does not change until FIN is received.
[NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [UNREPLIED]
[UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80
[UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]
[UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Change the data type of the following variables from int to bool
across ipvs code:
- found
- loop
- need_full_dest
- need_full_svc
- payload_csum
Also change the following functions to use bool full_entry param
instead of int:
- ip_vs_genl_parse_dest()
- ip_vs_genl_parse_service()
This patch does not change any functionality but makes the source
code slightly easier to read.
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>