2019-05-29 21:12:43 +07:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0-only */
|
2011-10-26 09:26:31 +07:00
|
|
|
/*
|
openvswitch: Add original direction conntrack tuple to sw_flow_key.
Add the fields of the conntrack original direction 5-tuple to struct
sw_flow_key. The new fields are initially marked as non-existent, and
are populated whenever a conntrack action is executed and either finds
or generates a conntrack entry. This means that these fields exist
for all packets that were not rejected by conntrack as untrackable.
The original tuple fields in the sw_flow_key are filled from the
original direction tuple of the conntrack entry relating to the
current packet, or from the original direction tuple of the master
conntrack entry, if the current conntrack entry has a master.
Generally, expected connections of connections having an assigned
helper (e.g., FTP), have a master conntrack entry.
The main purpose of the new conntrack original tuple fields is to
allow matching on them for policy decision purposes, with the premise
that the admissibility of tracked connections reply packets (as well
as original direction packets), and both direction packets of any
related connections may be based on ACL rules applying to the master
connection's original direction 5-tuple. This also makes it easier to
make policy decisions when the actual packet headers might have been
transformed by NAT, as the original direction 5-tuple represents the
packet headers before any such transformation.
When using the original direction 5-tuple the admissibility of return
and/or related packets need not be based on the mere existence of a
conntrack entry, allowing separation of admission policy from the
established conntrack state. While existence of a conntrack entry is
required for admission of the return or related packets, policy
changes can render connections that were initially admitted to be
rejected or dropped afterwards. If the admission of the return and
related packets was based on mere conntrack state (e.g., connection
being in an established state), a policy change that would make the
connection rejected or dropped would need to find and delete all
conntrack entries affected by such a change. When using the original
direction 5-tuple matching the affected conntrack entries can be
allowed to time out instead, as the established state of the
connection would not need to be the basis for packet admission any
more.
It should be noted that the directionality of related connections may
be the same or different than that of the master connection, and
neither the original direction 5-tuple nor the conntrack state bits
carry this information. If needed, the directionality of the master
connection can be stored in master's conntrack mark or labels, which
are automatically inherited by the expected related connections.
The fact that neither ARP nor ND packets are trackable by conntrack
allows mutual exclusion between ARP/ND and the new conntrack original
tuple fields. Hence, the IP addresses are overlaid in union with ARP
and ND fields. This allows the sw_flow_key to not grow much due to
this patch, but it also means that we must be careful to never use the
new key fields with ARP or ND packets. ARP is easy to distinguish and
keep mutually exclusive based on the ethernet type, but ND being an
ICMPv6 protocol requires a bit more attention.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 02:21:59 +07:00
|
|
|
* Copyright (c) 2007-2017 Nicira, Inc.
|
2011-10-26 09:26:31 +07:00
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef FLOW_H
|
|
|
|
#define FLOW_H 1
|
|
|
|
|
2013-10-30 07:22:21 +07:00
|
|
|
#include <linux/cache.h>
|
2011-10-26 09:26:31 +07:00
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/netlink.h>
|
|
|
|
#include <linux/openvswitch.h>
|
|
|
|
#include <linux/spinlock.h>
|
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/rcupdate.h>
|
|
|
|
#include <linux/if_ether.h>
|
|
|
|
#include <linux/in6.h>
|
|
|
|
#include <linux/jiffies.h>
|
|
|
|
#include <linux/time.h>
|
openvswitch: Optimize operations for OvS flow_stats.
When calling the flow_free() to free the flow, we call many times
(cpu_possible_mask, eg. 128 as default) cpumask_next(). That will
take up our CPU usage if we call the flow_free() frequently.
When we put all packets to userspace via upcall, and OvS will send
them back via netlink to ovs_packet_cmd_execute(will call flow_free).
The test topo is shown as below. VM01 sends TCP packets to VM02,
and OvS forward packtets. When testing, we use perf to report the
system performance.
VM01 --- OvS-VM --- VM02
Without this patch, perf-top show as below: The flow_free() is
3.02% CPU usage.
4.23% [kernel] [k] _raw_spin_unlock_irqrestore
3.62% [kernel] [k] __do_softirq
3.16% [kernel] [k] __memcpy
3.02% [kernel] [k] flow_free
2.42% libc-2.17.so [.] __memcpy_ssse3_back
2.18% [kernel] [k] copy_user_generic_unrolled
2.17% [kernel] [k] find_next_bit
When applied this patch, perf-top show as below: Not shown on
the list anymore.
4.11% [kernel] [k] _raw_spin_unlock_irqrestore
3.79% [kernel] [k] __do_softirq
3.46% [kernel] [k] __memcpy
2.73% libc-2.17.so [.] __memcpy_ssse3_back
2.25% [kernel] [k] copy_user_generic_unrolled
1.89% libc-2.17.so [.] _int_malloc
1.53% ovs-vswitchd [.] xlate_actions
With this patch, the TCP throughput(we dont use Megaflow Cache
+ Microflow Cache) between VMs is 1.18Gbs/sec up to 1.30Gbs/sec
(maybe ~10% performance imporve).
This patch adds cpumask struct, the cpu_used_mask stores the cpu_id
that the flow used. And we only check the flow_stats on the cpu we
used, and it is unncessary to check all possible cpu when getting,
cleaning, and updating the flow_stats. Adding the cpu_used_mask to
sw_flow struct does’t increase the cacheline number.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-18 13:28:06 +07:00
|
|
|
#include <linux/cpumask.h>
|
2011-10-26 09:26:31 +07:00
|
|
|
#include <net/inet_ecn.h>
|
2015-07-21 15:43:54 +07:00
|
|
|
#include <net/ip_tunnels.h>
|
2015-07-21 15:44:03 +07:00
|
|
|
#include <net/dst_metadata.h>
|
2017-11-07 20:07:02 +07:00
|
|
|
#include <net/nsh.h>
|
2011-10-26 09:26:31 +07:00
|
|
|
|
|
|
|
struct sk_buff;
|
|
|
|
|
2016-11-10 22:28:18 +07:00
|
|
|
enum sw_flow_mac_proto {
|
|
|
|
MAC_PROTO_NONE = 0,
|
|
|
|
MAC_PROTO_ETHERNET,
|
|
|
|
};
|
|
|
|
#define SW_FLOW_KEY_INVALID 0x80
|
|
|
|
|
2014-10-04 05:35:33 +07:00
|
|
|
/* Store options at the end of the array if they are less than the
|
|
|
|
* maximum size. This allows us to get the benefits of variable length
|
|
|
|
* matching for small options.
|
|
|
|
*/
|
2015-01-15 09:53:57 +07:00
|
|
|
#define TUN_METADATA_OFFSET(opt_len) \
|
|
|
|
(FIELD_SIZEOF(struct sw_flow_key, tun_opts) - opt_len)
|
|
|
|
#define TUN_METADATA_OPTS(flow_key, opt_len) \
|
|
|
|
((void *)((flow_key)->tun_opts + TUN_METADATA_OFFSET(opt_len)))
|
2014-10-04 05:35:33 +07:00
|
|
|
|
2015-07-21 15:44:03 +07:00
|
|
|
struct ovs_tunnel_info {
|
|
|
|
struct metadata_dst *tun_dst;
|
|
|
|
};
|
|
|
|
|
2016-09-07 23:56:59 +07:00
|
|
|
struct vlan_head {
|
|
|
|
__be16 tpid; /* Vlan type. Generally 802.1q or 802.1ad.*/
|
2018-11-09 00:44:50 +07:00
|
|
|
__be16 tci; /* 0 if no VLAN, VLAN_CFI_MASK set otherwise. */
|
2016-09-07 23:56:59 +07:00
|
|
|
};
|
|
|
|
|
2014-11-06 21:51:24 +07:00
|
|
|
#define OVS_SW_FLOW_KEY_METADATA_SIZE \
|
|
|
|
(offsetof(struct sw_flow_key, recirc_id) + \
|
|
|
|
FIELD_SIZEOF(struct sw_flow_key, recirc_id))
|
|
|
|
|
2017-11-07 20:07:02 +07:00
|
|
|
struct ovs_key_nsh {
|
|
|
|
struct ovs_nsh_key_base base;
|
|
|
|
__be32 context[NSH_MD1_CONTEXT_SIZE];
|
|
|
|
};
|
|
|
|
|
2011-10-26 09:26:31 +07:00
|
|
|
struct sw_flow_key {
|
2016-03-16 07:42:51 +07:00
|
|
|
u8 tun_opts[IP_TUNNEL_OPTS_MAX];
|
2014-10-04 05:35:33 +07:00
|
|
|
u8 tun_opts_len;
|
2015-07-21 15:43:54 +07:00
|
|
|
struct ip_tunnel_key tun_key; /* Encapsulating tunnel key. */
|
2011-10-26 09:26:31 +07:00
|
|
|
struct {
|
|
|
|
u32 priority; /* Packet QoS priority. */
|
2012-11-27 02:24:11 +07:00
|
|
|
u32 skb_mark; /* SKB mark. */
|
2012-08-24 02:40:54 +07:00
|
|
|
u16 in_port; /* Input switch port (or DP_MAX_PORTS). */
|
2014-05-05 23:54:49 +07:00
|
|
|
} __packed phy; /* Safe when right after 'tun_key'. */
|
2016-11-10 22:28:18 +07:00
|
|
|
u8 mac_proto; /* MAC layer protocol (e.g. Ethernet). */
|
2015-10-05 18:09:46 +07:00
|
|
|
u8 tun_proto; /* Protocol of encapsulating tunnel. */
|
2014-09-16 09:37:25 +07:00
|
|
|
u32 ovs_flow_hash; /* Datapath computed hash value. */
|
|
|
|
u32 recirc_id; /* Recirculation ID. */
|
2011-10-26 09:26:31 +07:00
|
|
|
struct {
|
|
|
|
u8 src[ETH_ALEN]; /* Ethernet source address. */
|
|
|
|
u8 dst[ETH_ALEN]; /* Ethernet destination address. */
|
2016-09-07 23:56:59 +07:00
|
|
|
struct vlan_head vlan;
|
|
|
|
struct vlan_head cvlan;
|
2011-10-26 09:26:31 +07:00
|
|
|
__be16 type; /* Ethernet frame type. */
|
|
|
|
} eth;
|
2017-02-10 02:22:01 +07:00
|
|
|
/* Filling a hole of two bytes. */
|
|
|
|
u8 ct_state;
|
|
|
|
u8 ct_orig_proto; /* CT original direction tuple IP
|
|
|
|
* protocol.
|
|
|
|
*/
|
2014-10-06 19:05:13 +07:00
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
__be32 top_lse; /* top label stack entry */
|
|
|
|
} mpls;
|
|
|
|
struct {
|
|
|
|
u8 proto; /* IP protocol or lower 8 bits of ARP opcode. */
|
|
|
|
u8 tos; /* IP ToS. */
|
|
|
|
u8 ttl; /* IP TTL/hop limit. */
|
|
|
|
u8 frag; /* One of OVS_FRAG_TYPE_*. */
|
|
|
|
} ip;
|
|
|
|
};
|
2017-02-10 02:22:01 +07:00
|
|
|
u16 ct_zone; /* Conntrack zone. */
|
2014-05-05 23:54:49 +07:00
|
|
|
struct {
|
|
|
|
__be16 src; /* TCP/UDP/SCTP source port. */
|
|
|
|
__be16 dst; /* TCP/UDP/SCTP destination port. */
|
|
|
|
__be16 flags; /* TCP flags. */
|
|
|
|
} tp;
|
2011-10-26 09:26:31 +07:00
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
struct {
|
|
|
|
__be32 src; /* IP source address. */
|
|
|
|
__be32 dst; /* IP destination address. */
|
|
|
|
} addr;
|
openvswitch: Add original direction conntrack tuple to sw_flow_key.
Add the fields of the conntrack original direction 5-tuple to struct
sw_flow_key. The new fields are initially marked as non-existent, and
are populated whenever a conntrack action is executed and either finds
or generates a conntrack entry. This means that these fields exist
for all packets that were not rejected by conntrack as untrackable.
The original tuple fields in the sw_flow_key are filled from the
original direction tuple of the conntrack entry relating to the
current packet, or from the original direction tuple of the master
conntrack entry, if the current conntrack entry has a master.
Generally, expected connections of connections having an assigned
helper (e.g., FTP), have a master conntrack entry.
The main purpose of the new conntrack original tuple fields is to
allow matching on them for policy decision purposes, with the premise
that the admissibility of tracked connections reply packets (as well
as original direction packets), and both direction packets of any
related connections may be based on ACL rules applying to the master
connection's original direction 5-tuple. This also makes it easier to
make policy decisions when the actual packet headers might have been
transformed by NAT, as the original direction 5-tuple represents the
packet headers before any such transformation.
When using the original direction 5-tuple the admissibility of return
and/or related packets need not be based on the mere existence of a
conntrack entry, allowing separation of admission policy from the
established conntrack state. While existence of a conntrack entry is
required for admission of the return or related packets, policy
changes can render connections that were initially admitted to be
rejected or dropped afterwards. If the admission of the return and
related packets was based on mere conntrack state (e.g., connection
being in an established state), a policy change that would make the
connection rejected or dropped would need to find and delete all
conntrack entries affected by such a change. When using the original
direction 5-tuple matching the affected conntrack entries can be
allowed to time out instead, as the established state of the
connection would not need to be the basis for packet admission any
more.
It should be noted that the directionality of related connections may
be the same or different than that of the master connection, and
neither the original direction 5-tuple nor the conntrack state bits
carry this information. If needed, the directionality of the master
connection can be stored in master's conntrack mark or labels, which
are automatically inherited by the expected related connections.
The fact that neither ARP nor ND packets are trackable by conntrack
allows mutual exclusion between ARP/ND and the new conntrack original
tuple fields. Hence, the IP addresses are overlaid in union with ARP
and ND fields. This allows the sw_flow_key to not grow much due to
this patch, but it also means that we must be careful to never use the
new key fields with ARP or ND packets. ARP is easy to distinguish and
keep mutually exclusive based on the ethernet type, but ND being an
ICMPv6 protocol requires a bit more attention.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 02:21:59 +07:00
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
__be32 src;
|
|
|
|
__be32 dst;
|
|
|
|
} ct_orig; /* Conntrack original direction fields. */
|
|
|
|
struct {
|
|
|
|
u8 sha[ETH_ALEN]; /* ARP source hardware address. */
|
|
|
|
u8 tha[ETH_ALEN]; /* ARP target hardware address. */
|
|
|
|
} arp;
|
|
|
|
};
|
2011-10-26 09:26:31 +07:00
|
|
|
} ipv4;
|
|
|
|
struct {
|
|
|
|
struct {
|
|
|
|
struct in6_addr src; /* IPv6 source address. */
|
|
|
|
struct in6_addr dst; /* IPv6 destination address. */
|
|
|
|
} addr;
|
|
|
|
__be32 label; /* IPv6 flow label. */
|
openvswitch: Add original direction conntrack tuple to sw_flow_key.
Add the fields of the conntrack original direction 5-tuple to struct
sw_flow_key. The new fields are initially marked as non-existent, and
are populated whenever a conntrack action is executed and either finds
or generates a conntrack entry. This means that these fields exist
for all packets that were not rejected by conntrack as untrackable.
The original tuple fields in the sw_flow_key are filled from the
original direction tuple of the conntrack entry relating to the
current packet, or from the original direction tuple of the master
conntrack entry, if the current conntrack entry has a master.
Generally, expected connections of connections having an assigned
helper (e.g., FTP), have a master conntrack entry.
The main purpose of the new conntrack original tuple fields is to
allow matching on them for policy decision purposes, with the premise
that the admissibility of tracked connections reply packets (as well
as original direction packets), and both direction packets of any
related connections may be based on ACL rules applying to the master
connection's original direction 5-tuple. This also makes it easier to
make policy decisions when the actual packet headers might have been
transformed by NAT, as the original direction 5-tuple represents the
packet headers before any such transformation.
When using the original direction 5-tuple the admissibility of return
and/or related packets need not be based on the mere existence of a
conntrack entry, allowing separation of admission policy from the
established conntrack state. While existence of a conntrack entry is
required for admission of the return or related packets, policy
changes can render connections that were initially admitted to be
rejected or dropped afterwards. If the admission of the return and
related packets was based on mere conntrack state (e.g., connection
being in an established state), a policy change that would make the
connection rejected or dropped would need to find and delete all
conntrack entries affected by such a change. When using the original
direction 5-tuple matching the affected conntrack entries can be
allowed to time out instead, as the established state of the
connection would not need to be the basis for packet admission any
more.
It should be noted that the directionality of related connections may
be the same or different than that of the master connection, and
neither the original direction 5-tuple nor the conntrack state bits
carry this information. If needed, the directionality of the master
connection can be stored in master's conntrack mark or labels, which
are automatically inherited by the expected related connections.
The fact that neither ARP nor ND packets are trackable by conntrack
allows mutual exclusion between ARP/ND and the new conntrack original
tuple fields. Hence, the IP addresses are overlaid in union with ARP
and ND fields. This allows the sw_flow_key to not grow much due to
this patch, but it also means that we must be careful to never use the
new key fields with ARP or ND packets. ARP is easy to distinguish and
keep mutually exclusive based on the ethernet type, but ND being an
ICMPv6 protocol requires a bit more attention.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 02:21:59 +07:00
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
struct in6_addr src;
|
|
|
|
struct in6_addr dst;
|
|
|
|
} ct_orig; /* Conntrack original direction fields. */
|
|
|
|
struct {
|
|
|
|
struct in6_addr target; /* ND target address. */
|
|
|
|
u8 sll[ETH_ALEN]; /* ND source link layer address. */
|
|
|
|
u8 tll[ETH_ALEN]; /* ND target link layer address. */
|
|
|
|
} nd;
|
|
|
|
};
|
2011-10-26 09:26:31 +07:00
|
|
|
} ipv6;
|
2017-11-07 20:07:02 +07:00
|
|
|
struct ovs_key_nsh nsh; /* network service header */
|
2011-10-26 09:26:31 +07:00
|
|
|
};
|
2015-08-27 01:31:48 +07:00
|
|
|
struct {
|
2017-02-10 02:22:01 +07:00
|
|
|
/* Connection tracking fields not packed above. */
|
openvswitch: Add original direction conntrack tuple to sw_flow_key.
Add the fields of the conntrack original direction 5-tuple to struct
sw_flow_key. The new fields are initially marked as non-existent, and
are populated whenever a conntrack action is executed and either finds
or generates a conntrack entry. This means that these fields exist
for all packets that were not rejected by conntrack as untrackable.
The original tuple fields in the sw_flow_key are filled from the
original direction tuple of the conntrack entry relating to the
current packet, or from the original direction tuple of the master
conntrack entry, if the current conntrack entry has a master.
Generally, expected connections of connections having an assigned
helper (e.g., FTP), have a master conntrack entry.
The main purpose of the new conntrack original tuple fields is to
allow matching on them for policy decision purposes, with the premise
that the admissibility of tracked connections reply packets (as well
as original direction packets), and both direction packets of any
related connections may be based on ACL rules applying to the master
connection's original direction 5-tuple. This also makes it easier to
make policy decisions when the actual packet headers might have been
transformed by NAT, as the original direction 5-tuple represents the
packet headers before any such transformation.
When using the original direction 5-tuple the admissibility of return
and/or related packets need not be based on the mere existence of a
conntrack entry, allowing separation of admission policy from the
established conntrack state. While existence of a conntrack entry is
required for admission of the return or related packets, policy
changes can render connections that were initially admitted to be
rejected or dropped afterwards. If the admission of the return and
related packets was based on mere conntrack state (e.g., connection
being in an established state), a policy change that would make the
connection rejected or dropped would need to find and delete all
conntrack entries affected by such a change. When using the original
direction 5-tuple matching the affected conntrack entries can be
allowed to time out instead, as the established state of the
connection would not need to be the basis for packet admission any
more.
It should be noted that the directionality of related connections may
be the same or different than that of the master connection, and
neither the original direction 5-tuple nor the conntrack state bits
carry this information. If needed, the directionality of the master
connection can be stored in master's conntrack mark or labels, which
are automatically inherited by the expected related connections.
The fact that neither ARP nor ND packets are trackable by conntrack
allows mutual exclusion between ARP/ND and the new conntrack original
tuple fields. Hence, the IP addresses are overlaid in union with ARP
and ND fields. This allows the sw_flow_key to not grow much due to
this patch, but it also means that we must be careful to never use the
new key fields with ARP or ND packets. ARP is easy to distinguish and
keep mutually exclusive based on the ethernet type, but ND being an
ICMPv6 protocol requires a bit more attention.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 02:21:59 +07:00
|
|
|
struct {
|
|
|
|
__be16 src; /* CT orig tuple tp src port. */
|
|
|
|
__be16 dst; /* CT orig tuple tp dst port. */
|
|
|
|
} orig_tp;
|
2017-02-10 02:22:01 +07:00
|
|
|
u32 mark;
|
2015-10-02 05:00:37 +07:00
|
|
|
struct ovs_key_ct_labels labels;
|
2015-08-27 01:31:48 +07:00
|
|
|
} ct;
|
|
|
|
|
2013-09-06 02:17:05 +07:00
|
|
|
} __aligned(BITS_PER_LONG/8); /* Ensure that we can do comparisons as longs. */
|
2011-10-26 09:26:31 +07:00
|
|
|
|
openvswitch: Add original direction conntrack tuple to sw_flow_key.
Add the fields of the conntrack original direction 5-tuple to struct
sw_flow_key. The new fields are initially marked as non-existent, and
are populated whenever a conntrack action is executed and either finds
or generates a conntrack entry. This means that these fields exist
for all packets that were not rejected by conntrack as untrackable.
The original tuple fields in the sw_flow_key are filled from the
original direction tuple of the conntrack entry relating to the
current packet, or from the original direction tuple of the master
conntrack entry, if the current conntrack entry has a master.
Generally, expected connections of connections having an assigned
helper (e.g., FTP), have a master conntrack entry.
The main purpose of the new conntrack original tuple fields is to
allow matching on them for policy decision purposes, with the premise
that the admissibility of tracked connections reply packets (as well
as original direction packets), and both direction packets of any
related connections may be based on ACL rules applying to the master
connection's original direction 5-tuple. This also makes it easier to
make policy decisions when the actual packet headers might have been
transformed by NAT, as the original direction 5-tuple represents the
packet headers before any such transformation.
When using the original direction 5-tuple the admissibility of return
and/or related packets need not be based on the mere existence of a
conntrack entry, allowing separation of admission policy from the
established conntrack state. While existence of a conntrack entry is
required for admission of the return or related packets, policy
changes can render connections that were initially admitted to be
rejected or dropped afterwards. If the admission of the return and
related packets was based on mere conntrack state (e.g., connection
being in an established state), a policy change that would make the
connection rejected or dropped would need to find and delete all
conntrack entries affected by such a change. When using the original
direction 5-tuple matching the affected conntrack entries can be
allowed to time out instead, as the established state of the
connection would not need to be the basis for packet admission any
more.
It should be noted that the directionality of related connections may
be the same or different than that of the master connection, and
neither the original direction 5-tuple nor the conntrack state bits
carry this information. If needed, the directionality of the master
connection can be stored in master's conntrack mark or labels, which
are automatically inherited by the expected related connections.
The fact that neither ARP nor ND packets are trackable by conntrack
allows mutual exclusion between ARP/ND and the new conntrack original
tuple fields. Hence, the IP addresses are overlaid in union with ARP
and ND fields. This allows the sw_flow_key to not grow much due to
this patch, but it also means that we must be careful to never use the
new key fields with ARP or ND packets. ARP is easy to distinguish and
keep mutually exclusive based on the ethernet type, but ND being an
ICMPv6 protocol requires a bit more attention.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 02:21:59 +07:00
|
|
|
static inline bool sw_flow_key_is_nd(const struct sw_flow_key *key)
|
|
|
|
{
|
|
|
|
return key->eth.type == htons(ETH_P_IPV6) &&
|
|
|
|
key->ip.proto == NEXTHDR_ICMP &&
|
|
|
|
key->tp.dst == 0 &&
|
|
|
|
(key->tp.src == htons(NDISC_NEIGHBOUR_SOLICITATION) ||
|
|
|
|
key->tp.src == htons(NDISC_NEIGHBOUR_ADVERTISEMENT));
|
|
|
|
}
|
|
|
|
|
2013-10-04 08:16:47 +07:00
|
|
|
struct sw_flow_key_range {
|
2013-11-26 01:41:28 +07:00
|
|
|
unsigned short int start;
|
|
|
|
unsigned short int end;
|
2013-10-04 08:16:47 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
struct sw_flow_mask {
|
|
|
|
int ref_count;
|
|
|
|
struct rcu_head rcu;
|
|
|
|
struct sw_flow_key_range range;
|
|
|
|
struct sw_flow_key key;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct sw_flow_match {
|
|
|
|
struct sw_flow_key *key;
|
|
|
|
struct sw_flow_key_range range;
|
|
|
|
struct sw_flow_mask *mask;
|
|
|
|
};
|
|
|
|
|
2015-01-22 07:42:52 +07:00
|
|
|
#define MAX_UFID_LENGTH 16 /* 128 bits */
|
|
|
|
|
|
|
|
struct sw_flow_id {
|
|
|
|
u32 ufid_len;
|
|
|
|
union {
|
|
|
|
u32 ufid[MAX_UFID_LENGTH / 4];
|
|
|
|
struct sw_flow_key *unmasked_key;
|
|
|
|
};
|
|
|
|
};
|
|
|
|
|
2013-10-04 08:16:47 +07:00
|
|
|
struct sw_flow_actions {
|
|
|
|
struct rcu_head rcu;
|
2015-08-27 01:31:44 +07:00
|
|
|
size_t orig_len; /* From flow_cmd_new netlink actions size */
|
2013-10-04 08:16:47 +07:00
|
|
|
u32 actions_len;
|
|
|
|
struct nlattr actions[];
|
|
|
|
};
|
|
|
|
|
2019-07-19 23:20:13 +07:00
|
|
|
struct sw_flow_stats {
|
2013-10-30 07:22:21 +07:00
|
|
|
u64 packet_count; /* Number of packets matched. */
|
|
|
|
u64 byte_count; /* Number of bytes matched. */
|
|
|
|
unsigned long used; /* Last used time (in jiffies). */
|
|
|
|
spinlock_t lock; /* Lock for atomic stats update. */
|
|
|
|
__be16 tcp_flags; /* Union of seen TCP flags. */
|
|
|
|
};
|
|
|
|
|
2011-10-26 09:26:31 +07:00
|
|
|
struct sw_flow {
|
|
|
|
struct rcu_head rcu;
|
2015-01-22 07:42:52 +07:00
|
|
|
struct {
|
|
|
|
struct hlist_node node[2];
|
|
|
|
u32 hash;
|
|
|
|
} flow_table, ufid_table;
|
2016-09-16 05:11:53 +07:00
|
|
|
int stats_last_writer; /* CPU id of the last writer on
|
openvswitch: Per NUMA node flow stats.
Keep kernel flow stats for each NUMA node rather than each (logical)
CPU. This avoids using the per-CPU allocator and removes most of the
kernel-side OVS locking overhead otherwise on the top of perf reports
and allows OVS to scale better with higher number of threads.
With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
rate doubles on a server with two hyper-threaded physical CPUs (16
logical cores each) compared to the current OVS master. Tested with
non-trivial flow table with a TCP port match rule forcing all new
connections with unique port numbers to OVS userspace. The IP
addresses are still wildcarded, so the kernel flows are not considered
as exact match 5-tuple flows. This type of flows can be expected to
appear in large numbers as the result of more effective wildcarding
made possible by improvements in OVS userspace flow classifier.
Perf results for this test (master):
Events: 305K cycles
+ 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
+ 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
+ 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc
+ 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
+ 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area
+ 2.19% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
+ 2.03% swapper [kernel.kallsyms] [k] intel_idle
+ 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
+ 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
+ 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6
+ 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset
+ 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock
+ 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock
...
And after this patch:
Events: 356K cycles
+ 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc
+ 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
+ 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
+ 2.81% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
+ 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
+ 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
+ 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f
+ 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
+ 1.47% swapper [kernel.kallsyms] [k] intel_idle
+ 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask
+ 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref
+ 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash
+ 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions
+ 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref
+ 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock
...
There is a small increase in kernel spinlock overhead due to the same
spinlock being shared between multiple cores of the same physical CPU,
but that is barely visible in the netperf TCP_CRR test performance
(maybe ~1% performance drop, hard to tell exactly due to variance in
the test results), when testing for kernel module throughput (with no
userspace activity, handful of kernel flows).
On flow setup, a single stats instance is allocated (for the NUMA node
0). As CPUs from multiple NUMA nodes start updating stats, new
NUMA-node specific stats instances are allocated. This allocation on
the packet processing code path is made to never block or look for
emergency memory pools, minimizing the allocation latency. If the
allocation fails, the existing preallocated stats instance is used.
Also, if only CPUs from one NUMA-node are updating the preallocated
stats instance, no additional stats instances are allocated. This
eliminates the need to pre-allocate stats instances that will not be
used, also relieving the stats reader from the burden of reading stats
that are never used.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-03-28 02:42:54 +07:00
|
|
|
* 'stats[0]'.
|
|
|
|
*/
|
2011-10-26 09:26:31 +07:00
|
|
|
struct sw_flow_key key;
|
2015-01-22 07:42:52 +07:00
|
|
|
struct sw_flow_id id;
|
openvswitch: Optimize operations for OvS flow_stats.
When calling the flow_free() to free the flow, we call many times
(cpu_possible_mask, eg. 128 as default) cpumask_next(). That will
take up our CPU usage if we call the flow_free() frequently.
When we put all packets to userspace via upcall, and OvS will send
them back via netlink to ovs_packet_cmd_execute(will call flow_free).
The test topo is shown as below. VM01 sends TCP packets to VM02,
and OvS forward packtets. When testing, we use perf to report the
system performance.
VM01 --- OvS-VM --- VM02
Without this patch, perf-top show as below: The flow_free() is
3.02% CPU usage.
4.23% [kernel] [k] _raw_spin_unlock_irqrestore
3.62% [kernel] [k] __do_softirq
3.16% [kernel] [k] __memcpy
3.02% [kernel] [k] flow_free
2.42% libc-2.17.so [.] __memcpy_ssse3_back
2.18% [kernel] [k] copy_user_generic_unrolled
2.17% [kernel] [k] find_next_bit
When applied this patch, perf-top show as below: Not shown on
the list anymore.
4.11% [kernel] [k] _raw_spin_unlock_irqrestore
3.79% [kernel] [k] __do_softirq
3.46% [kernel] [k] __memcpy
2.73% libc-2.17.so [.] __memcpy_ssse3_back
2.25% [kernel] [k] copy_user_generic_unrolled
1.89% libc-2.17.so [.] _int_malloc
1.53% ovs-vswitchd [.] xlate_actions
With this patch, the TCP throughput(we dont use Megaflow Cache
+ Microflow Cache) between VMs is 1.18Gbs/sec up to 1.30Gbs/sec
(maybe ~10% performance imporve).
This patch adds cpumask struct, the cpu_used_mask stores the cpu_id
that the flow used. And we only check the flow_stats on the cpu we
used, and it is unncessary to check all possible cpu when getting,
cleaning, and updating the flow_stats. Adding the cpu_used_mask to
sw_flow struct does’t increase the cacheline number.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-18 13:28:06 +07:00
|
|
|
struct cpumask cpu_used_mask;
|
2013-08-08 10:01:00 +07:00
|
|
|
struct sw_flow_mask *mask;
|
2011-10-26 09:26:31 +07:00
|
|
|
struct sw_flow_actions __rcu *sf_acts;
|
2019-07-19 23:20:13 +07:00
|
|
|
struct sw_flow_stats __rcu *stats[]; /* One for each CPU. First one
|
openvswitch: Per NUMA node flow stats.
Keep kernel flow stats for each NUMA node rather than each (logical)
CPU. This avoids using the per-CPU allocator and removes most of the
kernel-side OVS locking overhead otherwise on the top of perf reports
and allows OVS to scale better with higher number of threads.
With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
rate doubles on a server with two hyper-threaded physical CPUs (16
logical cores each) compared to the current OVS master. Tested with
non-trivial flow table with a TCP port match rule forcing all new
connections with unique port numbers to OVS userspace. The IP
addresses are still wildcarded, so the kernel flows are not considered
as exact match 5-tuple flows. This type of flows can be expected to
appear in large numbers as the result of more effective wildcarding
made possible by improvements in OVS userspace flow classifier.
Perf results for this test (master):
Events: 305K cycles
+ 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
+ 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
+ 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc
+ 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
+ 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area
+ 2.19% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
+ 2.03% swapper [kernel.kallsyms] [k] intel_idle
+ 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
+ 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
+ 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6
+ 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset
+ 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock
+ 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock
...
And after this patch:
Events: 356K cycles
+ 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc
+ 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
+ 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
+ 2.81% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
+ 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
+ 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
+ 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f
+ 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
+ 1.47% swapper [kernel.kallsyms] [k] intel_idle
+ 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask
+ 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref
+ 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash
+ 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions
+ 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref
+ 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock
...
There is a small increase in kernel spinlock overhead due to the same
spinlock being shared between multiple cores of the same physical CPU,
but that is barely visible in the netperf TCP_CRR test performance
(maybe ~1% performance drop, hard to tell exactly due to variance in
the test results), when testing for kernel module throughput (with no
userspace activity, handful of kernel flows).
On flow setup, a single stats instance is allocated (for the NUMA node
0). As CPUs from multiple NUMA nodes start updating stats, new
NUMA-node specific stats instances are allocated. This allocation on
the packet processing code path is made to never block or look for
emergency memory pools, minimizing the allocation latency. If the
allocation fails, the existing preallocated stats instance is used.
Also, if only CPUs from one NUMA-node are updating the preallocated
stats instance, no additional stats instances are allocated. This
eliminates the need to pre-allocate stats instances that will not be
used, also relieving the stats reader from the burden of reading stats
that are never used.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-03-28 02:42:54 +07:00
|
|
|
* is allocated at flow creation time,
|
|
|
|
* the rest are allocated on demand
|
|
|
|
* while holding the 'stats[0].lock'.
|
|
|
|
*/
|
2011-10-26 09:26:31 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
struct arp_eth_header {
|
|
|
|
__be16 ar_hrd; /* format of hardware address */
|
|
|
|
__be16 ar_pro; /* format of protocol address */
|
|
|
|
unsigned char ar_hln; /* length of hardware address */
|
|
|
|
unsigned char ar_pln; /* length of protocol address */
|
|
|
|
__be16 ar_op; /* ARP opcode (command) */
|
|
|
|
|
|
|
|
/* Ethernet+IPv4 specific members. */
|
|
|
|
unsigned char ar_sha[ETH_ALEN]; /* sender hardware address */
|
|
|
|
unsigned char ar_sip[4]; /* sender IP address */
|
|
|
|
unsigned char ar_tha[ETH_ALEN]; /* target hardware address */
|
|
|
|
unsigned char ar_tip[4]; /* target IP address */
|
|
|
|
} __packed;
|
|
|
|
|
2016-11-10 22:28:18 +07:00
|
|
|
static inline u8 ovs_key_mac_proto(const struct sw_flow_key *key)
|
|
|
|
{
|
|
|
|
return key->mac_proto & ~SW_FLOW_KEY_INVALID;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u16 __ovs_mac_header_len(u8 mac_proto)
|
|
|
|
{
|
|
|
|
return mac_proto == MAC_PROTO_ETHERNET ? ETH_HLEN : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u16 ovs_mac_header_len(const struct sw_flow_key *key)
|
|
|
|
{
|
|
|
|
return __ovs_mac_header_len(ovs_key_mac_proto(key));
|
|
|
|
}
|
|
|
|
|
2015-01-22 07:42:52 +07:00
|
|
|
static inline bool ovs_identifier_is_ufid(const struct sw_flow_id *sfid)
|
|
|
|
{
|
|
|
|
return sfid->ufid_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool ovs_identifier_is_key(const struct sw_flow_id *sfid)
|
|
|
|
{
|
|
|
|
return !ovs_identifier_is_ufid(sfid);
|
|
|
|
}
|
|
|
|
|
2014-05-07 06:48:38 +07:00
|
|
|
void ovs_flow_stats_update(struct sw_flow *, __be16 tcp_flags,
|
2014-11-06 21:58:52 +07:00
|
|
|
const struct sk_buff *);
|
2014-05-06 04:17:28 +07:00
|
|
|
void ovs_flow_stats_get(const struct sw_flow *, struct ovs_flow_stats *,
|
2013-10-30 07:22:21 +07:00
|
|
|
unsigned long *used, __be16 *tcp_flags);
|
2014-05-06 04:17:28 +07:00
|
|
|
void ovs_flow_stats_clear(struct sw_flow *);
|
2011-10-26 09:26:31 +07:00
|
|
|
u64 ovs_flow_used_time(unsigned long flow_jiffies);
|
|
|
|
|
2014-09-16 09:37:25 +07:00
|
|
|
int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key);
|
2019-08-27 21:58:09 +07:00
|
|
|
int ovs_flow_key_update_l3l4(struct sk_buff *skb, struct sw_flow_key *key);
|
2015-07-21 15:43:54 +07:00
|
|
|
int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
|
2014-11-06 21:58:52 +07:00
|
|
|
struct sk_buff *skb,
|
2014-10-04 05:35:31 +07:00
|
|
|
struct sw_flow_key *key);
|
2014-09-16 09:20:31 +07:00
|
|
|
/* Extract key from packet coming from userspace. */
|
2015-08-27 01:31:52 +07:00
|
|
|
int ovs_flow_key_extract_userspace(struct net *net, const struct nlattr *attr,
|
2014-09-16 09:20:31 +07:00
|
|
|
struct sk_buff *skb,
|
2014-11-06 22:03:05 +07:00
|
|
|
struct sw_flow_key *key, bool log);
|
2013-08-08 10:01:00 +07:00
|
|
|
|
2011-10-26 09:26:31 +07:00
|
|
|
#endif /* flow.h */
|