Hardware ring buffer data is stored in Little-endian. Thus cpu data
should be modified to Little-endian.
Signed-off-by: Qianqian Xie <xieqianqian@huawei.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In current scenario, when the interface is disabled we reset the XGMAC
RX/TX functionality. This operation does not affects the PHY layer/SFP
and which appears UP to the remote end(this behaviour is unlike GMAC).
The result is remote end keeps on sending the packets which gets partly
processed by XMAC and dropped. Since these are partly processed these
appears as errored packets in the packet counter statistics.
This patch fixes this behaviour and adds local-fault and remote-fault
functionality which can be used to intimate the remote peer whenever
the state of the interface changes. This patch also removes the
existing hns_dsaf_xge_core_srst_by_port function which was being used
to reset the RX/TX functionality at XGE Core.
Reported-by: Jun He <hjat2005@huawei.com>
Signed-off-by: Daode Huang <huangdaode@hisilicon.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch modify the gmac_rx_filt_pkt and gmac_rx_octets_total_filt
statistics value. The two statistics is inconsistent with register,
and just the opposite.
Signed-off-by: Qianqian Xie <xieqianqian@huawei.com>
Signed-off-by: Jun He <hjat2005@huawei.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch deletes redundant macro definitions in hns drivers.
And change the .h file containing relation to make the layers
more clearly
Signed-off-by: Qianqian Xie <xieqianqian@huawei.com>
Signed-off-by: Weiwei Deng <dengweiwei@huawei.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When set auto-negotiation off and duplex half, if run "ethtool -r ethX"
on port with phy, then the port will be failed to work. It should
forbid to start auto-negotiation when auto-negotiate is off. This
patch add the limited condition.
Reported-by: Jinchuang Tian <tianjinchuang1@huawei.com>
Signed-off-by: Daode Huang <huangdaode@hisilicon.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Reviewed-by: lipeng <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default mac pause time set to 0xff which is too short for pausing,
this patch change it to the max value 0xffff.
Signed-off-by: Daode Huang <huangdaode@hisilicon.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Reviewed-by: lipeng <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If set promisc mode when there is some traffic, The service nic will
cause system halted. We reserve the last 6 tcam entry for the 6 ports.
If promisc mode is enabled, we can config the relative tcam as fuzzy
matching and set to be valid, or set the tcam to be invalid
Signed-off-by: Kejian Yan <yankejian@huawei.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since there is not enough tcam table entries for vlan and multicast
address, HNSv2 needs to add support of fuzzy matching of TCAM tables.
To add fuzzy match of TCAM, we Add the property to mask the bits to
be fuzzy matched
Signed-off-by: Kejian Yan <yankejian@huawei.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since there is not enough tcam table entries for every vlan and multicast
address, HNS needs to add support of fuzzy matching of TCAM tables. Adding
the property to mask the bits to be fuzzy matched, so update the bindings
document
Signed-off-by: Kejian Yan <yankejian@huawei.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a LOCALINV WR is flushed, the frmr is marked STALE, then
frwr_op_unmap_sync DMA-unmaps the frmr's SGL. These STALE frmrs
are then recovered when frwr_op_map hunts for an INVALID frmr to
use.
All other cases that need frmr recovery leave that SGL DMA-mapped.
The FRMR recovery path unconditionally DMA-unmaps the frmr's SGL.
To avoid DMA unmapping the SGL twice for flushed LOCAL_INV WRs,
alter the recovery logic (rather than the hot frwr_op_unmap_sync
path) to distinguish among these cases. This solution also takes
care of the case where multiple LOCAL_INV WRs are issued for the
same rpcrdma_req, some complete successfully, but some are flushed.
Reported-by: Vasco Steinmetz <linux@kyberraum.net>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Vasco Steinmetz <linux@kyberraum.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
net/netfilter/ipset/ip_set_hash_ipmac.c:70:8-9: WARNING: return of 0/1 in function 'hash_ipmac4_data_list' with return type bool
net/netfilter/ipset/ip_set_hash_ipmac.c:178:8-9: WARNING: return of 0/1 in function 'hash_ipmac6_data_list' with return type bool
Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci
CC: Tomasz Chilinski <tomasz.chilinski@chilan.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Use setup_timer() and instead of init_timer(), being the preferred way
of setting up a timer.
Also, quoting the mod_timer() function comment:
-> mod_timer() is a more efficient way to update the expire field of an
active timer (if the timer is inactive it will be activated).
Use setup_timer() and mod_timer() to setup and arm a timer, making the
code compact and easier to read.
Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
The calculation of the full allocated memory did not take
into account the size of the base hash bucket structure at some
places.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
The set full case (with net_ratelimit()-ed pr_warn()) is already
handled, simply jump there.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Before this patch struct htype created at the first source
of ip_set_hash_gen.h and it is common for both IPv4 and IPv6
set variants.
Make struct htype per ipset family and use NLEN to make
nets array fixed size to simplify struct htype allocation.
Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Exit as easly as possible on error and use RCU_INIT_POINTER()
as set is not seen at creation time.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Data for hashing required to be array of u32. Make sure that
element data always multiple of u32.
Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Hash types define HOST_MASK before inclusion of ip_set_hash_gen.h
and the only place where NLEN needed to be calculated at runtime
is *_create() method.
Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Remove one leve of intendation by using continue while
iterating over elements in bucket.
Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Remove redundant parameters nets_length and dsize, because
they can be get from other parameters.
Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Non-static (i.e. comment) extension was not counted into the memory
size. A new internal counter is introduced for this. In the case of
the hash types the sizes of the arrays are counted there as well so
that we can avoid to scan the whole set when just the header data
is requested.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
It is better to list the set elements for all set types, thus the
header information is uniform. Element counts are therefore added
to the bitmap and list types.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
It would be useful for userspace to query the size of an ipset hash,
however, this data is not exposed to userspace outside of counting the
number of member entries. This patch uses the attribute
IPSET_ATTR_ELEMENTS to indicate the size in the the header that is
exported to userspace. This field is then printed by the userspace
tool for hashes.
Signed-off-by: Eric B Munson <emunson@akamai.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Josh Hunt <johunt@akamai.com>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cleanup: group ip_set_put_extensions and ip_set_get_extensions
together and add missing extern.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Hash types already has it's memsize calculation code in separate
functions. Clean up and do the same for *bitmap* and *list* sets.
Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.
Suggested-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cleanup to separate all extensions into individual files.
Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.
Suggested-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Allocate memory with kmalloc() rather than kzalloc(): the string
is immediately initialized so it is unnecessary to zero out
the allocated memory area.
Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.
Suggested-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Use struct ip_set_skbinfo in struct ip_set_ext instead of open
coded fields and assign structure members in get/init helpers
instead of copying members one by one. Explicitly note that
struct ip_set_skbinfo must be padded to prevent non-aligned
access in the extension blob.
Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.
Suggested-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Mark some of the helpers arguments as const.
Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.
Suggested-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
The TIOCMIWAIT implementation would return -EINVAL if any of the three
supported signals were included in the mask.
Instead of returning an error in case TIOCM_CTS is included, simply
drop the mask check completely, which is in accordance with how other
drivers implement this ioctl.
Fixes: 5a6a62bdb9 ("cdc-acm: add TIOCMIWAIT")
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Acked-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After commit 6ed46d1247 ("sock_diag: align nlattr properly when
needed"), tcp_get_info() gets 64bit aligned memory, so we can avoid
the unaligned helpers.
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo noted an Android unit test failed due to e0d56fdd73:
"The expectation in the test was that the RST replying to a SYN sent to a
closed port should be generated with oif=0. In other words it should not
prefer the interface where the SYN came in on, but instead should follow
whatever the routing table says it should do."
Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
that the oif in the flow is set to the skb_iif only if skb_iif is an L3
master.
Fixes: e0d56fdd73 ("net: l3mdev: remove redundant calls")
Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- netlink and code cleanups by Sven Eckelmann (3 patches)
- Cleanup and minor fixes by Linus Luessing (3 patches)
- Speed up multicast update intervals, by Linus Luessing
- Avoid (re)broadcast in meshes for some easy cases,
by Linus Luessing
- Clean up tx return state handling, by Sven Eckelmann (6 patches)
- Fix some special mac address handling cases, by Sven Eckelmann
(3 patches)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdBQJYI6GFFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
nqHyEw/9GkYNRQJOk1JMuW0cDvj9uWqoendvXRNPVkCvqh4gjX4o+aQaeyumv1/v
eYqpslQWmSsrIlGQ6UGCegzyzZ7jXo6ZijM7wvz2bWwB2C0NzUGlBBCzOeA6Bui/
3Fq+Xmx0Xcf5+c82YmrLor/yYp4FIFTao4+a80vHzQeI/Hg8RuJTOFJdtVNV3JPP
VrfzMAPLLXJPPKHjt1PN3lfANWqX6nWLUMhHBNkMpYB+mMdyaCve6X+MxPF+WYBH
wBO8spU35chW7dp8HOncof5nRDv2xVHWs6TN2kdJ762YrZ1oL0GXwWXViKhWskSQ
QEeOLboyj3IuwPsxOQOLQEbAMrp6jqj3L/6lYWRkV2U6Bbi8EYdozW8L3utxMcvA
Dft8D2U5JAD5ja0VUFyGhwNaBFien2B9JSEwsyOLtUbaQSASNyvym75WrN2Ey/d7
JhBzUt6Iwh8RNJylY3nG5OkoNnyXYv3VrQLsIW4QTHc8Um9eaiOeFHtuAi6WNBtI
HgMwPcdErNbmPd3w9OM6kk6aBQ/DTUK/7CNUKYVoGDayGxKYGDwqhoog9zm18wrt
wc/TtdIY+q95hgm8fDCJefrnkaIxDJrVtChs30N/pJ24MeKcHuibop3HzxIngze2
zPZTuXRKA2VSt79+EV4KORAutexi1WQIN7nRH1a8zMsYfyMKG8Y=
=1xrj
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20161108-v2' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
pull request for net-next: batman-adv 2016-11-08 v2
This feature and cleanup patchset includes the following changes:
- netlink and code cleanups by Sven Eckelmann (3 patches)
- Cleanup and minor fixes by Linus Luessing (3 patches)
- Speed up multicast update intervals, by Linus Luessing
- Avoid (re)broadcast in meshes for some easy cases,
by Linus Luessing
- Clean up tx return state handling, by Sven Eckelmann (6 patches)
- Fix some special mac address handling cases, by Sven Eckelmann
(3 patches)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for Cypress GX3 SuperSpeed to Gigabit Ethernet
Bridge Controller (Vendor=04b4 ProdID=3610).
Patch verified on x64 linux kernel 4.7.4, 4.8.6, 4.9-rc4 systems
with the Kensington SD4600P USB-C Universal Dock with Power,
which uses the Cypress GX3 SuperSpeed to Gigabit Ethernet Bridge
Controller.
A similar patch was signed-off and tested-by Allan Chou
<allan@asix.com.tw> on 2015-12-01.
Allan verified his similar patch on x86 Linux kernel 4.1.6 system
with Cypress GX3 SuperSpeed to Gigabit Ethernet Bridge Controller.
Tested-by: Allan Chou <allan@asix.com.tw>
Tested-by: Chris Roth <chris.roth@usask.ca>
Tested-by: Artjom Simon <artjom.simon@gmail.com>
Signed-off-by: Allan Chou <allan@asix.com.tw>
Signed-off-by: Chris Roth <chris.roth@usask.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Cochran says:
====================
PHC frequency fine tuning
This series expands the PTP Hardware Clock subsystem by adding a
method that passes the frequency tuning word to the the drivers
without dropping the low order bits. Keeping those bits is useful for
drivers whose frequency resolution is higher than 1 ppb.
The appended script (below) runs a simple demonstration of the
improvement. This test needs two Intel i210 PCIe cards installed in
the same PC, with their SDP0 pins connected by copper wire. Measuring
the estimated offset (from the ptp4l servo) and the true offset (from
the PPS) over one hour yields the following statistics.
| | Est. Before | Est. After | True Before | True After |
|--------+---------------+---------------+---------------+---------------|
| min | -5.200000e+01 | -1.600000e+01 | -3.100000e+01 | -1.000000e+00 |
| max | +5.700000e+01 | +2.500000e+01 | +8.500000e+01 | +4.000000e+01 |
| pk-pk: | +1.090000e+02 | +4.100000e+01 | +1.160000e+02 | +4.100000e+01 |
| mean | +6.472222e-02 | +1.277778e-02 | +2.422083e+01 | +1.826083e+01 |
| stddev | +1.158006e+01 | +4.581982e+00 | +1.207708e+01 | +4.981435e+00 |
Here the numbers in units of nanoseconds, and the ~20 nanosecond PPS
offset is due to input/output delays on the i210's external interface
logic.
With the series applied, both the peak to peak error and the standard
deviation improve by a factor of more than two. These two graphs show
the improvement nicely.
http://linuxptp.sourceforge.net/fine-tuning/fine-est.pnghttp://linuxptp.sourceforge.net/fine-tuning/fine-tru.png
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The dp83640 has a frequency resolution of about 0.029 ppb.
This patch lets users of the device benefit from the
increased frequency resolution when tuning the clock.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 82580 and related devices offer a frequency resolution of about
0.029 ppb. This patch lets users of the device benefit from the
increased frequency resolution when tuning the clock.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The internal PTP Hardware Clock (PHC) interface limits the resolution for
frequency adjustments to one part per billion. However, some hardware
devices allow finer adjustment, and making use of the increased resolution
improves synchronization measurably on such devices.
This patch adds an alternative method that allows finer frequency tuning
by passing the scaled ppm value to PHC drivers. This value comes from
user space, and it has a resolution of about 0.015 ppb. We also deprecate
the older method, anticipating its removal once existing drivers have been
converted over.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Ulrik De Bie <ulrik.debie-os@e2big.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no more users except from net/core/dev.c
napi_hash_add() can now be static.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is automatically done from netif_napi_add(), and we want to not
export napi_hash_add() anymore in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the unused but set variables min_set and max_set in
adjust_reg_min_max_vals to fix the following warning when building with
'W=1':
kernel/bpf/verifier.c:1483:7: warning: variable ‘min_set’ set but not used [-Wunused-but-set-variable]
There is no warning about max_set being unused, but since it is only
used in the assignment of min_set it can be removed as well.
They were introduced in commit 484611357c ("bpf: allow access into map
value arrays") but seem to have never been used.
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tc_act macro addressed a non existing field, and was not used in the
kernel source.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Lebrun says:
====================
net: add support for IPv6 Segment Routing
v5:
- Check SRH validity when adding a new route with lwtunnels and
when setting an IPV6_RTHDR socket option.
- Check that hdr->segments_left is not out of bounds when processing
an SR-enabled packet.
- Add __ro_after_init attribute to seg6_genl_policy structure.
- Add CONFIG_IPV6_SEG6_INLINE option to enable or disable
direct header insertion.
v4:
- Change @cleanup in ipv6_srh_rcv() from int to bool
- Move checksum helper functions into header file
- Add common definition for SR TLVs
- Add comments for HMAC computation algorithm
- Use rhashtable to store HMAC infos instead of linked list
- Remove packed attribute for struct sr6_tlv_hmac
- Use dst cache only if CONFIG_DST_CACHE is enabled
v3:
- Fix compilation for CONFIG_IPV6={n,m}
v2:
- Remove packed attribute from sr6 struct and replaced unaligned
16-bit flags with two 8-bit flags.
- SR code now included by default. Option CONFIG_IPV6_SEG6_HMAC
exists for HMAC support (which requires crypto dependencies).
- Replace "hidden" calls to mutex_{un,}lock to direct calls.
- Fix reverse xmas tree coding style.
- Fix cast-from-void*'s.
- Update skb->csum to account for SR modifications.
- Add dst_cache in seg6_output.
Segment Routing (SR) is a source routing paradigm, architecturally
defined in draft-ietf-spring-segment-routing-09 [1]. The IPv6 flavor of
SR is defined in draft-ietf-6man-segment-routing-header-02 [2].
The main idea is that an SR-enabled packet contains a list of segments,
which represent mandatory waypoints. Each waypoint is called a segment
endpoint. The SR-enabled packet is routed normally (e.g. shortest path)
between the segment endpoints. A node that inserts an SRH into a packet
is called an ingress node, and a node that is the last segment endpoint
is called an egress node.
From an IPv6 viewpoint, an SR-enabled packet contains an IPv6 extension
header, which is a Routing Header type 4, defined as follows:
struct ipv6_sr_hdr {
__u8 nexthdr;
__u8 hdrlen;
__u8 type;
__u8 segments_left;
__u8 first_segment;
__u8 flag_1;
__u8 flag_2;
__u8 reserved;
struct in6_addr segments[0];
};
The first 4 bytes of the SRH is consistent with the Routing Header
definition in RFC 2460. The type is set to `4' (SRH).
Each segment is encoded as an IPv6 address. The segments are encoded in
reverse order: segments[0] is the last segment of the path, and
segments[first_segment] is the first segment of the path.
segments[segments_left] points to the currently active segment and
segments_left is decremented at each segment endpoint.
There exist two ways for a packet to receive an SRH, we call them
encap mode and inline mode. In the encap mode, the packet is encapsulated
in an outer IPv6 header that contains the SRH. The inner (original) packet
is not modified. A virtual tunnel is thus created between the ingress node
(the node that encapsulates) and the egress node (the last segment of the path).
Once an encapsulated SR packet reaches the egress node, the node decapsulates
the packet and performs a routing decision on the inner packet. This kind of
SRH insertion is intended to use for routers that encapsulates in-transit
packet.
The second SRH insertion method, the inline mode, acts by directly inserting
the SRH right after the IPv6 header of the original packet. For this method,
if a particular flag (SR6_FLAG_CLEANUP) is set, then the penultimate segment
endpoint must strip the SRH from the packet before forwarding it to the last
segment endpoint. This insertion method is intended to use for endhosts,
however it is also used for in-transit packets by some industry actors.
Note that directly inserting extension headers may break several mechanisms
such as Path MTU Discovery, IPSec AH, etc. For this reason, this insertion
method is only available if CONFIG_IPV6_SEG6_INLINE is enabled.
Finally, the SRH may contain TLVs after the segments list. Several types of
TLVs are defined, but we currently consider only the HMAC TLV. This TLV is
an answer to the deprecation of the RH0 and enables to ensure the authenticity
and integrity of the SRH. The HMAC text contains the flags, the first_segment
index, the full list of segments, and the source address of the packet. While
SR is intended to use mostly within a single administrative domain, the HMAC
TLV allows to verify SR packets coming from an untrusted source.
This patches series implements support for the IPv6 flavor of SR and is
logically divided into the following components:
(1) Data plane support (patch 01). This patch adds a function
in net/ipv6/exthdrs.c to handle the Routing Header type 4.
It enables the kernel to act as a segment endpoint, by supporting
the following operations: decrementation of the segments_left field,
cleanup flag support (removal of the SRH if we are the penultimate
segment endpoint) and decapsulation of the inner packet as an egress
node.
(2) Control plane support (patches 02..03 and 07..09). These patches enables
to insert SRH on locally emitted and/or forwarded packets, both with
encap mode and with inline mode. The SRH insertion is controlled through
the lightweight tunnels mechanism. Furthermore, patch 08 enables the
applications to insert an SRH on a per-socket basis, through the
setsockopt() system call. The mechanism to specify a per-socket
Routing Header was already defined for RH0 and no special modification
was performed on this side. However, the code to actually push the RH
onto the packets had to be adapted for the SRH specifications.
(3) HMAC support (patches 04..06). These patches adds the support of the
HMAC TLV verification for the dataplane part, and generation for
the control plane part. Two hashing algorithms are supported
(SHA-1 as legacy and SHA-256 as required by the IETF draft), but
additional algorithms can be easily supported by simply adding an
entry into an array.
[1] https://tools.ietf.org/html/draft-ietf-spring-segment-routing-09
[2] https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-02
====================
Signed-off-by: David S. Miller <davem@davemloft.net>