2006-01-03 01:04:38 +07:00
|
|
|
/*
|
2014-06-09 23:08:18 +07:00
|
|
|
* net/tipc/socket.c: TIPC socket API
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
* Copyright (c) 2001-2007, 2012-2017, Ericsson AB
|
tipc: introduce new TIPC server infrastructure
TIPC has two internal servers, one providing a subscription
service for topology events, and another providing the
configuration interface. These servers have previously been running
in BH context, accessing the TIPC-port (aka native) API directly.
Apart from these servers, even the TIPC socket implementation is
partially built on this API.
As this API may simultaneously be called via different paths and in
different contexts, a complex and costly lock policiy is required
in order to protect TIPC internal resources.
To eliminate the need for this complex lock policiy, we introduce
a new, generic service API that uses kernel sockets for message
passing instead of the native API. Once the toplogy and configuration
servers are converted to use this new service, all code pertaining
to the native API can be removed. This entails a significant
reduction in code amount and complexity, and opens up for a complete
rework of the locking policy in TIPC.
The new service also solves another problem:
As the current topology server works in BH context, it cannot easily
be blocked when sending of events fails due to congestion. In such
cases events may have to be silently dropped, something that is
unacceptable. Therefore, the new service keeps a dedicated outbound
queue receiving messages from BH context. Once messages are
inserted into this queue, we will immediately schedule a work from a
special workqueue. This way, messages/events from the topology server
are in reality sent in process context, and the server can block
if necessary.
Analogously, there is a new workqueue for receiving messages. Once a
notification about an arriving message is received in BH context, we
schedule a work from the receive workqueue to do the job of
receiving the message in process context.
As both sending and receive messages are now finished in processes,
subscribed events cannot be dropped any more.
As of this commit, this new server infrastructure is built, but
not actually yet called by the existing TIPC code, but since the
conversion changes required in order to use it are significant,
the addition is kept here as a separate commit.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 21:54:39 +07:00
|
|
|
* Copyright (c) 2004-2008, 2010-2013, Wind River Systems
|
2006-01-03 01:04:38 +07:00
|
|
|
* All rights reserved.
|
|
|
|
*
|
2006-01-11 19:30:43 +07:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
2006-01-03 01:04:38 +07:00
|
|
|
* modification, are permitted provided that the following conditions are met:
|
|
|
|
*
|
2006-01-11 19:30:43 +07:00
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 3. Neither the names of the copyright holders nor the names of its
|
|
|
|
* contributors may be used to endorse or promote products derived from
|
|
|
|
* this software without specific prior written permission.
|
2006-01-03 01:04:38 +07:00
|
|
|
*
|
2006-01-11 19:30:43 +07:00
|
|
|
* Alternatively, this software may be distributed under the terms of the
|
|
|
|
* GNU General Public License ("GPL") version 2 as published by the Free
|
|
|
|
* Software Foundation.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
|
|
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
|
|
|
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
|
|
|
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
|
|
|
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
|
|
|
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
|
|
|
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
|
|
|
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
2006-01-03 01:04:38 +07:00
|
|
|
* POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
#include <linux/rhashtable.h>
|
2017-02-03 01:15:33 +07:00
|
|
|
#include <linux/sched/signal.h>
|
|
|
|
|
2006-01-03 01:04:38 +07:00
|
|
|
#include "core.h"
|
2014-06-26 08:41:37 +07:00
|
|
|
#include "name_table.h"
|
2014-04-24 21:26:47 +07:00
|
|
|
#include "node.h"
|
2014-06-26 08:41:37 +07:00
|
|
|
#include "link.h"
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
#include "name_distr.h"
|
2014-08-23 05:09:18 +07:00
|
|
|
#include "socket.h"
|
2015-05-14 21:46:13 +07:00
|
|
|
#include "bcast.h"
|
2016-03-04 23:04:42 +07:00
|
|
|
#include "netlink.h"
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
#include "group.h"
|
tipc: enable tracepoints in tipc
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:56 +07:00
|
|
|
#include "trace.h"
|
2012-06-29 11:16:37 +07:00
|
|
|
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
#define NAGLE_START_INIT 4
|
|
|
|
#define NAGLE_START_MAX 1024
|
2018-09-29 01:23:22 +07:00
|
|
|
#define CONN_TIMEOUT_DEFAULT 8000 /* default connect timeout = 8s */
|
2017-10-20 16:21:32 +07:00
|
|
|
#define CONN_PROBING_INTV msecs_to_jiffies(3600000) /* [ms] => 1 h */
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
#define TIPC_FWD_MSG 1
|
|
|
|
#define TIPC_MAX_PORT 0xffffffff
|
|
|
|
#define TIPC_MIN_PORT 1
|
2017-05-02 23:16:53 +07:00
|
|
|
#define TIPC_ACK_RATE 4 /* ACK at 1/4 of of rcv window size */
|
2014-08-23 05:09:20 +07:00
|
|
|
|
2016-11-01 20:02:43 +07:00
|
|
|
enum {
|
|
|
|
TIPC_LISTEN = TCP_LISTEN,
|
2016-11-01 20:02:44 +07:00
|
|
|
TIPC_ESTABLISHED = TCP_ESTABLISHED,
|
2016-11-01 20:02:45 +07:00
|
|
|
TIPC_OPEN = TCP_CLOSE,
|
2016-11-01 20:02:46 +07:00
|
|
|
TIPC_DISCONNECTING = TCP_CLOSE_WAIT,
|
2016-11-01 20:02:48 +07:00
|
|
|
TIPC_CONNECTING = TCP_SYN_SENT,
|
2016-11-01 20:02:43 +07:00
|
|
|
};
|
|
|
|
|
2017-10-13 16:04:24 +07:00
|
|
|
struct sockaddr_pair {
|
|
|
|
struct sockaddr_tipc sock;
|
|
|
|
struct sockaddr_tipc member;
|
|
|
|
};
|
|
|
|
|
2014-08-23 05:09:20 +07:00
|
|
|
/**
|
|
|
|
* struct tipc_sock - TIPC socket structure
|
|
|
|
* @sk: socket - interacts with 'port' and with user via the socket API
|
|
|
|
* @conn_type: TIPC type used when connection was established
|
|
|
|
* @conn_instance: TIPC instance used when connection was established
|
|
|
|
* @published: non-zero if port has one or more associated names
|
|
|
|
* @max_pkt: maximum packet size "hint" used when building messages sent by port
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
* @maxnagle: maximum size of msg which can be subject to nagle
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
* @portid: unique port identity in TIPC socket hash table
|
2014-08-23 05:09:20 +07:00
|
|
|
* @phdr: preformatted message header used when sending messages
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
* #cong_links: list of congested links
|
2014-08-23 05:09:20 +07:00
|
|
|
* @publications: list of publications for port
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
* @blocking_link: address of the congested link we are currently sleeping on
|
2014-08-23 05:09:20 +07:00
|
|
|
* @pub_count: total # of publications port has made during its lifetime
|
|
|
|
* @conn_timeout: the time we can wait for an unresponded setup request
|
|
|
|
* @dupl_rcvcnt: number of bytes counted twice, in both backlog and rcv queue
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
* @cong_link_cnt: number of congested links
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
* @snt_unacked: # messages sent by socket, and not yet acked by peer
|
2014-08-23 05:09:20 +07:00
|
|
|
* @rcv_unacked: # messages read by user, but not yet acked back to peer
|
2016-11-01 20:02:38 +07:00
|
|
|
* @peer: 'connected' peer for dgram/rdm
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
* @node: hash table node
|
2017-01-19 01:50:53 +07:00
|
|
|
* @mc_method: cookie for use between socket and broadcast layer
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
* @rcu: rcu struct for tipc_sock
|
2014-08-23 05:09:20 +07:00
|
|
|
*/
|
|
|
|
struct tipc_sock {
|
|
|
|
struct sock sk;
|
|
|
|
u32 conn_type;
|
|
|
|
u32 conn_instance;
|
|
|
|
int published;
|
|
|
|
u32 max_pkt;
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
u32 maxnagle;
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
u32 portid;
|
2014-08-23 05:09:20 +07:00
|
|
|
struct tipc_msg phdr;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
struct list_head cong_links;
|
2014-08-23 05:09:20 +07:00
|
|
|
struct list_head publications;
|
|
|
|
u32 pub_count;
|
|
|
|
atomic_t dupl_rcvcnt;
|
2018-09-29 01:23:22 +07:00
|
|
|
u16 conn_timeout;
|
2016-11-01 20:02:44 +07:00
|
|
|
bool probe_unacked;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
u16 cong_link_cnt;
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
u16 snt_unacked;
|
|
|
|
u16 snd_win;
|
2016-05-02 22:58:46 +07:00
|
|
|
u16 peer_caps;
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
u16 rcv_unacked;
|
|
|
|
u16 rcv_win;
|
2016-11-01 20:02:38 +07:00
|
|
|
struct sockaddr_tipc peer;
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
struct rhash_head node;
|
2017-01-19 01:50:53 +07:00
|
|
|
struct tipc_mc_method mc_method;
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
struct rcu_head rcu;
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
struct tipc_group *group;
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
u32 oneway;
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
u32 nagle_start;
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
u16 snd_backlog;
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
u16 msg_acc;
|
|
|
|
u16 pkt_cnt;
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
bool expect_ack;
|
|
|
|
bool nodelay;
|
2018-01-17 22:42:46 +07:00
|
|
|
bool group_is_open;
|
2014-08-23 05:09:20 +07:00
|
|
|
};
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-10-13 16:04:20 +07:00
|
|
|
static int tipc_sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
|
2014-04-12 03:15:36 +07:00
|
|
|
static void tipc_data_ready(struct sock *sk);
|
2012-08-21 10:16:57 +07:00
|
|
|
static void tipc_write_space(struct sock *sk);
|
2015-11-22 14:46:05 +07:00
|
|
|
static void tipc_sock_destruct(struct sock *sk);
|
2014-02-18 15:06:46 +07:00
|
|
|
static int tipc_release(struct socket *sock);
|
2017-03-09 15:09:05 +07:00
|
|
|
static int tipc_accept(struct socket *sock, struct socket *new_sock, int flags,
|
|
|
|
bool kern);
|
2017-10-31 04:06:45 +07:00
|
|
|
static void tipc_sk_timeout(struct timer_list *t);
|
2014-08-23 05:09:20 +07:00
|
|
|
static int tipc_sk_publish(struct tipc_sock *tsk, uint scope,
|
2014-08-23 05:09:17 +07:00
|
|
|
struct tipc_name_seq const *seq);
|
2014-08-23 05:09:20 +07:00
|
|
|
static int tipc_sk_withdraw(struct tipc_sock *tsk, uint scope,
|
2014-08-23 05:09:17 +07:00
|
|
|
struct tipc_name_seq const *seq);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
static int tipc_sk_leave(struct tipc_sock *tsk);
|
2015-01-09 14:27:08 +07:00
|
|
|
static struct tipc_sock *tipc_sk_lookup(struct net *net, u32 portid);
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
static int tipc_sk_insert(struct tipc_sock *tsk);
|
|
|
|
static void tipc_sk_remove(struct tipc_sock *tsk);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dsz);
|
2015-03-02 14:37:47 +07:00
|
|
|
static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dsz);
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
static void tipc_sk_push_backlog(struct tipc_sock *tsk, bool nagle_ack);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-02-08 09:18:01 +07:00
|
|
|
static const struct proto_ops packet_ops;
|
|
|
|
static const struct proto_ops stream_ops;
|
|
|
|
static const struct proto_ops msg_ops;
|
2006-01-03 01:04:38 +07:00
|
|
|
static struct proto tipc_proto;
|
2015-03-20 17:57:05 +07:00
|
|
|
static const struct rhashtable_params tsk_rht_params;
|
|
|
|
|
2015-02-05 20:36:36 +07:00
|
|
|
static u32 tsk_own_node(struct tipc_sock *tsk)
|
|
|
|
{
|
|
|
|
return msg_prevnode(&tsk->phdr);
|
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:20 +07:00
|
|
|
static u32 tsk_peer_node(struct tipc_sock *tsk)
|
2014-08-23 05:09:18 +07:00
|
|
|
{
|
2014-08-23 05:09:20 +07:00
|
|
|
return msg_destnode(&tsk->phdr);
|
2014-08-23 05:09:18 +07:00
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:20 +07:00
|
|
|
static u32 tsk_peer_port(struct tipc_sock *tsk)
|
2014-08-23 05:09:18 +07:00
|
|
|
{
|
2014-08-23 05:09:20 +07:00
|
|
|
return msg_destport(&tsk->phdr);
|
2014-08-23 05:09:18 +07:00
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:20 +07:00
|
|
|
static bool tsk_unreliable(struct tipc_sock *tsk)
|
2014-08-23 05:09:18 +07:00
|
|
|
{
|
2014-08-23 05:09:20 +07:00
|
|
|
return msg_src_droppable(&tsk->phdr) != 0;
|
2014-08-23 05:09:18 +07:00
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:20 +07:00
|
|
|
static void tsk_set_unreliable(struct tipc_sock *tsk, bool unreliable)
|
2014-08-23 05:09:18 +07:00
|
|
|
{
|
2014-08-23 05:09:20 +07:00
|
|
|
msg_set_src_droppable(&tsk->phdr, unreliable ? 1 : 0);
|
2014-08-23 05:09:18 +07:00
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:20 +07:00
|
|
|
static bool tsk_unreturnable(struct tipc_sock *tsk)
|
2014-08-23 05:09:18 +07:00
|
|
|
{
|
2014-08-23 05:09:20 +07:00
|
|
|
return msg_dest_droppable(&tsk->phdr) != 0;
|
2014-08-23 05:09:18 +07:00
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:20 +07:00
|
|
|
static void tsk_set_unreturnable(struct tipc_sock *tsk, bool unreturnable)
|
2014-08-23 05:09:18 +07:00
|
|
|
{
|
2014-08-23 05:09:20 +07:00
|
|
|
msg_set_dest_droppable(&tsk->phdr, unreturnable ? 1 : 0);
|
2014-08-23 05:09:18 +07:00
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:20 +07:00
|
|
|
static int tsk_importance(struct tipc_sock *tsk)
|
2014-08-23 05:09:18 +07:00
|
|
|
{
|
2014-08-23 05:09:20 +07:00
|
|
|
return msg_importance(&tsk->phdr);
|
2014-08-23 05:09:18 +07:00
|
|
|
}
|
|
|
|
|
2020-05-28 12:12:36 +07:00
|
|
|
static struct tipc_sock *tipc_sk(const struct sock *sk)
|
2014-08-23 05:09:18 +07:00
|
|
|
{
|
2020-05-28 12:12:36 +07:00
|
|
|
return container_of(sk, struct tipc_sock, sk);
|
2014-08-23 05:09:18 +07:00
|
|
|
}
|
2014-03-12 22:31:09 +07:00
|
|
|
|
2020-05-28 12:12:36 +07:00
|
|
|
int tsk_set_importance(struct sock *sk, int imp)
|
2014-08-23 05:09:20 +07:00
|
|
|
{
|
2020-05-28 12:12:36 +07:00
|
|
|
if (imp > TIPC_CRITICAL_IMPORTANCE)
|
|
|
|
return -EINVAL;
|
|
|
|
msg_set_importance(&tipc_sk(sk)->phdr, (u32)imp);
|
|
|
|
return 0;
|
2014-08-23 05:09:20 +07:00
|
|
|
}
|
|
|
|
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
static bool tsk_conn_cong(struct tipc_sock *tsk)
|
2014-08-23 05:09:20 +07:00
|
|
|
{
|
tipc: resolve connection flow control compatibility problem
In commit 10724cc7bb78 ("tipc: redesign connection-level flow control")
we replaced the previous message based flow control with one based on
1k blocks. In order to ensure backwards compatibility the mechanism
falls back to using message as base unit when it senses that the peer
doesn't support the new algorithm. The default flow control window,
i.e., how many units can be sent before the sender blocks and waits
for an acknowledge (aka advertisement) is 512. This was tested against
the previous version, which uses an acknowledge frequency of on ack per
256 received message, and found to work fine.
However, we missed the fact that versions older than Linux 3.15 use an
acknowledge frequency of 512, which is exactly the limit where a 4.6+
sender will stop and wait for acknowledge. This would also work fine if
it weren't for the fact that if the first sent message on a 4.6+ server
side is an empty SYNACK, this one is also is counted as a sent message,
while it is not counted as a received message on a legacy 3.15-receiver.
This leads to the sender always being one step ahead of the receiver, a
scenario causing the sender to block after 512 sent messages, while the
receiver only has registered 511 read messages. Hence, the legacy
receiver is not trigged to send an acknowledge, with a permanently
blocked sender as result.
We solve this deadlock by simply allowing the sender to send one more
message before it blocks, i.e., by a making minimal change to the
condition used for determining connection congestion.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 06:47:07 +07:00
|
|
|
return tsk->snt_unacked > tsk->snd_win;
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
}
|
|
|
|
|
2017-10-13 16:04:26 +07:00
|
|
|
static u16 tsk_blocks(int len)
|
|
|
|
{
|
|
|
|
return ((len / FLOWCTL_BLK_SZ) + 1);
|
|
|
|
}
|
|
|
|
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
/* tsk_blocks(): translate a buffer size in bytes to number of
|
|
|
|
* advertisable blocks, taking into account the ratio truesize(len)/len
|
|
|
|
* We can trust that this ratio is always < 4 for len >= FLOWCTL_BLK_SZ
|
|
|
|
*/
|
|
|
|
static u16 tsk_adv_blocks(int len)
|
|
|
|
{
|
|
|
|
return len / FLOWCTL_BLK_SZ / 4;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* tsk_inc(): increment counter for sent or received data
|
|
|
|
* - If block based flow control is not supported by peer we
|
|
|
|
* fall back to message based ditto, incrementing the counter
|
|
|
|
*/
|
|
|
|
static u16 tsk_inc(struct tipc_sock *tsk, int msglen)
|
|
|
|
{
|
|
|
|
if (likely(tsk->peer_caps & TIPC_BLOCK_FLOWCTL))
|
|
|
|
return ((msglen / FLOWCTL_BLK_SZ) + 1);
|
|
|
|
return 1;
|
2014-08-23 05:09:20 +07:00
|
|
|
}
|
|
|
|
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
/* tsk_set_nagle - enable/disable nagle property by manipulating maxnagle
|
|
|
|
*/
|
|
|
|
static void tsk_set_nagle(struct tipc_sock *tsk)
|
|
|
|
{
|
|
|
|
struct sock *sk = &tsk->sk;
|
|
|
|
|
|
|
|
tsk->maxnagle = 0;
|
|
|
|
if (sk->sk_type != SOCK_STREAM)
|
|
|
|
return;
|
|
|
|
if (tsk->nodelay)
|
|
|
|
return;
|
|
|
|
if (!(tsk->peer_caps & TIPC_NAGLE))
|
|
|
|
return;
|
|
|
|
/* Limit node local buffer size to avoid receive queue overflow */
|
|
|
|
if (tsk->max_pkt == MAX_MSG_SIZE)
|
|
|
|
tsk->maxnagle = 1500;
|
|
|
|
else
|
|
|
|
tsk->maxnagle = tsk->max_pkt;
|
|
|
|
}
|
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
/**
|
2014-08-23 05:09:18 +07:00
|
|
|
* tsk_advance_rx_queue - discard first buffer in socket receive queue
|
2008-04-15 14:22:02 +07:00
|
|
|
*
|
|
|
|
* Caller must hold socket lock
|
2006-01-03 01:04:38 +07:00
|
|
|
*/
|
2014-08-23 05:09:18 +07:00
|
|
|
static void tsk_advance_rx_queue(struct sock *sk)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_advance_rx(sk, NULL, TIPC_DUMP_SK_RCVQ, " ");
|
2011-11-05 00:24:29 +07:00
|
|
|
kfree_skb(__skb_dequeue(&sk->sk_receive_queue));
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
/* tipc_sk_respond() : send response message back to sender
|
|
|
|
*/
|
|
|
|
static void tipc_sk_respond(struct sock *sk, struct sk_buff *skb, int err)
|
|
|
|
{
|
|
|
|
u32 selector;
|
|
|
|
u32 dnode;
|
|
|
|
u32 onode = tipc_own_addr(sock_net(sk));
|
|
|
|
|
|
|
|
if (!tipc_msg_reverse(onode, &skb, err))
|
|
|
|
return;
|
|
|
|
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_rej_msg(sk, skb, TIPC_DUMP_NONE, "@sk_respond!");
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
dnode = msg_destnode(buf_msg(skb));
|
|
|
|
selector = msg_origport(buf_msg(skb));
|
|
|
|
tipc_node_xmit_skb(sock_net(sk), skb, dnode, selector);
|
|
|
|
}
|
|
|
|
|
2006-01-03 01:04:38 +07:00
|
|
|
/**
|
2014-08-23 05:09:18 +07:00
|
|
|
* tsk_rej_rx_queue - reject all buffers in socket receive queue
|
2008-04-15 14:22:02 +07:00
|
|
|
*
|
|
|
|
* Caller must hold socket lock
|
2006-01-03 01:04:38 +07:00
|
|
|
*/
|
2020-01-08 09:18:15 +07:00
|
|
|
static void tsk_rej_rx_queue(struct sock *sk, int error)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2014-11-26 10:41:55 +07:00
|
|
|
struct sk_buff *skb;
|
2008-04-15 14:22:02 +07:00
|
|
|
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
while ((skb = __skb_dequeue(&sk->sk_receive_queue)))
|
2020-01-08 09:18:15 +07:00
|
|
|
tipc_sk_respond(sk, skb, error);
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2016-11-01 20:02:40 +07:00
|
|
|
static bool tipc_sk_connected(struct sock *sk)
|
|
|
|
{
|
2016-11-01 20:02:49 +07:00
|
|
|
return sk->sk_state == TIPC_ESTABLISHED;
|
2016-11-01 20:02:40 +07:00
|
|
|
}
|
|
|
|
|
2016-11-01 20:02:42 +07:00
|
|
|
/* tipc_sk_type_connectionless - check if the socket is datagram socket
|
|
|
|
* @sk: socket
|
|
|
|
*
|
|
|
|
* Returns true if connection less, false otherwise
|
|
|
|
*/
|
|
|
|
static bool tipc_sk_type_connectionless(struct sock *sk)
|
|
|
|
{
|
|
|
|
return sk->sk_type == SOCK_RDM || sk->sk_type == SOCK_DGRAM;
|
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:18 +07:00
|
|
|
/* tsk_peer_msg - verify if message was sent by connected port's peer
|
2014-08-23 05:09:17 +07:00
|
|
|
*
|
|
|
|
* Handles cases where the node's network address has changed from
|
|
|
|
* the default of <0.0.0> to its configured setting.
|
|
|
|
*/
|
2014-08-23 05:09:18 +07:00
|
|
|
static bool tsk_peer_msg(struct tipc_sock *tsk, struct tipc_msg *msg)
|
2014-08-23 05:09:17 +07:00
|
|
|
{
|
2016-11-01 20:02:40 +07:00
|
|
|
struct sock *sk = &tsk->sk;
|
2018-03-23 02:42:49 +07:00
|
|
|
u32 self = tipc_own_addr(sock_net(sk));
|
2014-08-23 05:09:20 +07:00
|
|
|
u32 peer_port = tsk_peer_port(tsk);
|
2018-03-23 02:42:49 +07:00
|
|
|
u32 orig_node, peer_node;
|
2014-08-23 05:09:17 +07:00
|
|
|
|
2016-11-01 20:02:40 +07:00
|
|
|
if (unlikely(!tipc_sk_connected(sk)))
|
2014-08-23 05:09:17 +07:00
|
|
|
return false;
|
|
|
|
|
|
|
|
if (unlikely(msg_origport(msg) != peer_port))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
orig_node = msg_orignode(msg);
|
2014-08-23 05:09:20 +07:00
|
|
|
peer_node = tsk_peer_node(tsk);
|
2014-08-23 05:09:17 +07:00
|
|
|
|
|
|
|
if (likely(orig_node == peer_node))
|
|
|
|
return true;
|
|
|
|
|
2018-03-23 02:42:49 +07:00
|
|
|
if (!orig_node && peer_node == self)
|
2014-08-23 05:09:17 +07:00
|
|
|
return true;
|
|
|
|
|
2018-03-23 02:42:49 +07:00
|
|
|
if (!peer_node && orig_node == self)
|
2014-08-23 05:09:17 +07:00
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2016-11-01 20:02:43 +07:00
|
|
|
/* tipc_set_sk_state - set the sk_state of the socket
|
|
|
|
* @sk: socket
|
|
|
|
*
|
|
|
|
* Caller must hold socket lock
|
|
|
|
*
|
|
|
|
* Returns 0 on success, errno otherwise
|
|
|
|
*/
|
|
|
|
static int tipc_set_sk_state(struct sock *sk, int state)
|
|
|
|
{
|
2016-11-01 20:02:45 +07:00
|
|
|
int oldsk_state = sk->sk_state;
|
2016-11-01 20:02:43 +07:00
|
|
|
int res = -EINVAL;
|
|
|
|
|
|
|
|
switch (state) {
|
2016-11-01 20:02:45 +07:00
|
|
|
case TIPC_OPEN:
|
|
|
|
res = 0;
|
|
|
|
break;
|
2016-11-01 20:02:43 +07:00
|
|
|
case TIPC_LISTEN:
|
2016-11-01 20:02:48 +07:00
|
|
|
case TIPC_CONNECTING:
|
2016-11-01 20:02:45 +07:00
|
|
|
if (oldsk_state == TIPC_OPEN)
|
2016-11-01 20:02:43 +07:00
|
|
|
res = 0;
|
|
|
|
break;
|
2016-11-01 20:02:44 +07:00
|
|
|
case TIPC_ESTABLISHED:
|
2016-11-01 20:02:48 +07:00
|
|
|
if (oldsk_state == TIPC_CONNECTING ||
|
2016-11-01 20:02:45 +07:00
|
|
|
oldsk_state == TIPC_OPEN)
|
2016-11-01 20:02:44 +07:00
|
|
|
res = 0;
|
|
|
|
break;
|
2016-11-01 20:02:46 +07:00
|
|
|
case TIPC_DISCONNECTING:
|
2016-11-01 20:02:48 +07:00
|
|
|
if (oldsk_state == TIPC_CONNECTING ||
|
2016-11-01 20:02:46 +07:00
|
|
|
oldsk_state == TIPC_ESTABLISHED)
|
|
|
|
res = 0;
|
|
|
|
break;
|
2016-11-01 20:02:43 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!res)
|
|
|
|
sk->sk_state = state;
|
|
|
|
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2017-01-03 22:55:09 +07:00
|
|
|
static int tipc_sk_sock_err(struct socket *sock, long *timeout)
|
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
int err = sock_error(sk);
|
|
|
|
int typ = sock->type;
|
|
|
|
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
if (typ == SOCK_STREAM || typ == SOCK_SEQPACKET) {
|
|
|
|
if (sk->sk_state == TIPC_DISCONNECTING)
|
|
|
|
return -EPIPE;
|
|
|
|
else if (!tipc_sk_connected(sk))
|
|
|
|
return -ENOTCONN;
|
|
|
|
}
|
|
|
|
if (!*timeout)
|
|
|
|
return -EAGAIN;
|
|
|
|
if (signal_pending(current))
|
|
|
|
return sock_intr_errno(*timeout);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-05-12 01:28:15 +07:00
|
|
|
#define tipc_wait_for_cond(sock_, timeo_, condition_) \
|
|
|
|
({ \
|
2019-02-25 10:57:20 +07:00
|
|
|
DEFINE_WAIT_FUNC(wait_, woken_wake_function); \
|
2017-05-12 01:28:15 +07:00
|
|
|
struct sock *sk_; \
|
|
|
|
int rc_; \
|
|
|
|
\
|
|
|
|
while ((rc_ = !(condition_))) { \
|
2019-02-25 10:57:20 +07:00
|
|
|
/* coupled with smp_wmb() in tipc_sk_proto_rcv() */ \
|
|
|
|
smp_rmb(); \
|
2017-05-12 01:28:15 +07:00
|
|
|
sk_ = (sock_)->sk; \
|
|
|
|
rc_ = tipc_sk_sock_err((sock_), timeo_); \
|
|
|
|
if (rc_) \
|
|
|
|
break; \
|
2019-02-19 11:20:47 +07:00
|
|
|
add_wait_queue(sk_sleep(sk_), &wait_); \
|
2017-05-12 01:28:15 +07:00
|
|
|
release_sock(sk_); \
|
|
|
|
*(timeo_) = wait_woken(&wait_, TASK_INTERRUPTIBLE, *(timeo_)); \
|
|
|
|
sched_annotate_sleep(); \
|
|
|
|
lock_sock(sk_); \
|
|
|
|
remove_wait_queue(sk_sleep(sk_), &wait_); \
|
|
|
|
} \
|
|
|
|
rc_; \
|
2017-01-03 22:55:09 +07:00
|
|
|
})
|
|
|
|
|
2006-01-03 01:04:38 +07:00
|
|
|
/**
|
tipc: introduce new TIPC server infrastructure
TIPC has two internal servers, one providing a subscription
service for topology events, and another providing the
configuration interface. These servers have previously been running
in BH context, accessing the TIPC-port (aka native) API directly.
Apart from these servers, even the TIPC socket implementation is
partially built on this API.
As this API may simultaneously be called via different paths and in
different contexts, a complex and costly lock policiy is required
in order to protect TIPC internal resources.
To eliminate the need for this complex lock policiy, we introduce
a new, generic service API that uses kernel sockets for message
passing instead of the native API. Once the toplogy and configuration
servers are converted to use this new service, all code pertaining
to the native API can be removed. This entails a significant
reduction in code amount and complexity, and opens up for a complete
rework of the locking policy in TIPC.
The new service also solves another problem:
As the current topology server works in BH context, it cannot easily
be blocked when sending of events fails due to congestion. In such
cases events may have to be silently dropped, something that is
unacceptable. Therefore, the new service keeps a dedicated outbound
queue receiving messages from BH context. Once messages are
inserted into this queue, we will immediately schedule a work from a
special workqueue. This way, messages/events from the topology server
are in reality sent in process context, and the server can block
if necessary.
Analogously, there is a new workqueue for receiving messages. Once a
notification about an arriving message is received in BH context, we
schedule a work from the receive workqueue to do the job of
receiving the message in process context.
As both sending and receive messages are now finished in processes,
subscribed events cannot be dropped any more.
As of this commit, this new server infrastructure is built, but
not actually yet called by the existing TIPC code, but since the
conversion changes required in order to use it are significant,
the addition is kept here as a separate commit.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 21:54:39 +07:00
|
|
|
* tipc_sk_create - create a TIPC socket
|
2008-04-15 14:22:02 +07:00
|
|
|
* @net: network namespace (must be default network)
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: pre-allocated socket structure
|
|
|
|
* @protocol: protocol indicator (must be 0)
|
2009-11-06 13:18:14 +07:00
|
|
|
* @kern: caused by kernel or by userspace?
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2008-04-15 14:22:02 +07:00
|
|
|
* This routine creates additional data structures used by the TIPC socket,
|
|
|
|
* initializes them, and links them together.
|
2006-01-03 01:04:38 +07:00
|
|
|
*
|
|
|
|
* Returns 0 on success, errno otherwise
|
|
|
|
*/
|
2014-03-12 22:31:12 +07:00
|
|
|
static int tipc_sk_create(struct net *net, struct socket *sock,
|
|
|
|
int protocol, int kern)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-04-15 14:22:02 +07:00
|
|
|
const struct proto_ops *ops;
|
2006-01-03 01:04:38 +07:00
|
|
|
struct sock *sk;
|
2014-03-12 22:31:12 +07:00
|
|
|
struct tipc_sock *tsk;
|
2014-08-23 05:09:13 +07:00
|
|
|
struct tipc_msg *msg;
|
2008-04-15 14:22:02 +07:00
|
|
|
|
|
|
|
/* Validate arguments */
|
2006-01-03 01:04:38 +07:00
|
|
|
if (unlikely(protocol != 0))
|
|
|
|
return -EPROTONOSUPPORT;
|
|
|
|
|
|
|
|
switch (sock->type) {
|
|
|
|
case SOCK_STREAM:
|
2008-04-15 14:22:02 +07:00
|
|
|
ops = &stream_ops;
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
|
|
|
case SOCK_SEQPACKET:
|
2008-04-15 14:22:02 +07:00
|
|
|
ops = &packet_ops;
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
|
|
|
case SOCK_DGRAM:
|
|
|
|
case SOCK_RDM:
|
2008-04-15 14:22:02 +07:00
|
|
|
ops = &msg_ops;
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
2006-06-26 13:47:18 +07:00
|
|
|
default:
|
|
|
|
return -EPROTOTYPE;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
/* Allocate socket's protocol area */
|
2015-05-09 09:09:13 +07:00
|
|
|
sk = sk_alloc(net, AF_TIPC, GFP_KERNEL, &tipc_proto, kern);
|
2008-04-15 14:22:02 +07:00
|
|
|
if (sk == NULL)
|
2006-01-03 01:04:38 +07:00
|
|
|
return -ENOMEM;
|
|
|
|
|
2014-03-12 22:31:12 +07:00
|
|
|
tsk = tipc_sk(sk);
|
2014-08-23 05:09:20 +07:00
|
|
|
tsk->max_pkt = MAX_PKT_DEFAULT;
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
tsk->maxnagle = 0;
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
tsk->nagle_start = NAGLE_START_INIT;
|
2014-08-23 05:09:20 +07:00
|
|
|
INIT_LIST_HEAD(&tsk->publications);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
INIT_LIST_HEAD(&tsk->cong_links);
|
2014-08-23 05:09:20 +07:00
|
|
|
msg = &tsk->phdr;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
/* Finish initializing socket data structures */
|
|
|
|
sock->ops = ops;
|
|
|
|
sock_init_data(sock, sk);
|
2016-11-01 20:02:45 +07:00
|
|
|
tipc_set_sk_state(sk, TIPC_OPEN);
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
if (tipc_sk_insert(tsk)) {
|
2016-02-08 18:53:12 +07:00
|
|
|
pr_warn("Socket create failed; port number exhausted\n");
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2017-02-11 18:26:46 +07:00
|
|
|
|
|
|
|
/* Ensure tsk is visible before we read own_addr. */
|
|
|
|
smp_mb();
|
|
|
|
|
2018-03-23 02:42:49 +07:00
|
|
|
tipc_msg_init(tipc_own_addr(net), msg, TIPC_LOW_IMPORTANCE,
|
|
|
|
TIPC_NAMED_MSG, NAMED_H_SIZE, 0);
|
2017-02-11 18:26:46 +07:00
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
msg_set_origport(msg, tsk->portid);
|
2017-10-31 04:06:45 +07:00
|
|
|
timer_setup(&sk->sk_timer, tipc_sk_timeout, 0);
|
2016-11-01 20:02:47 +07:00
|
|
|
sk->sk_shutdown = 0;
|
2017-10-13 16:04:20 +07:00
|
|
|
sk->sk_backlog_rcv = tipc_sk_backlog_rcv;
|
2013-06-17 21:54:37 +07:00
|
|
|
sk->sk_rcvbuf = sysctl_tipc_rmem[1];
|
2012-08-21 10:16:57 +07:00
|
|
|
sk->sk_data_ready = tipc_data_ready;
|
|
|
|
sk->sk_write_space = tipc_write_space;
|
2015-11-22 14:46:05 +07:00
|
|
|
sk->sk_destruct = tipc_sock_destruct;
|
tipc: compensate for double accounting in socket rcv buffer
The function net/core/sock.c::__release_sock() runs a tight loop
to move buffers from the socket backlog queue to the receive queue.
As a security measure, sk_backlog.len of the receiving socket
is not set to zero until after the loop is finished, i.e., until
the whole backlog queue has been transferred to the receive queue.
During this transfer, the data that has already been moved is counted
both in the backlog queue and the receive queue, hence giving an
incorrect picture of the available queue space for new arriving buffers.
This leads to unnecessary rejection of buffers by sk_add_backlog(),
which in TIPC leads to unnecessarily broken connections.
In this commit, we compensate for this double accounting by adding
a counter that keeps track of it. The function socket.c::backlog_rcv()
receives buffers one by one from __release_sock(), and adds them to the
socket receive queue. If the transfer is successful, it increases a new
atomic counter 'tipc_sock::dupl_rcvcnt' with 'truesize' of the
transferred buffer. If a new buffer arrives during this transfer and
finds the socket busy (owned), we attempt to add it to the backlog.
However, when sk_add_backlog() is called, we adjust the 'limit'
parameter with the value of the new counter, so that the risk of
inadvertent rejection is eliminated.
It should be noted that this change does not invalidate the original
purpose of zeroing 'sk_backlog.len' after the full transfer. We set an
upper limit for dupl_rcvcnt, so that if a 'wild' sender (i.e., one that
doesn't respect the send window) keeps pumping in buffers to
sk_add_backlog(), he will eventually reach an upper limit,
(2 x TIPC_CONN_OVERLOAD_LIMIT). After that, no messages can be added
to the backlog, and the connection will be broken. Ordinary, well-
behaved senders will never reach this buffer limit at all.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-14 16:39:09 +07:00
|
|
|
tsk->conn_timeout = CONN_TIMEOUT_DEFAULT;
|
2018-02-27 02:14:04 +07:00
|
|
|
tsk->group_is_open = true;
|
tipc: compensate for double accounting in socket rcv buffer
The function net/core/sock.c::__release_sock() runs a tight loop
to move buffers from the socket backlog queue to the receive queue.
As a security measure, sk_backlog.len of the receiving socket
is not set to zero until after the loop is finished, i.e., until
the whole backlog queue has been transferred to the receive queue.
During this transfer, the data that has already been moved is counted
both in the backlog queue and the receive queue, hence giving an
incorrect picture of the available queue space for new arriving buffers.
This leads to unnecessary rejection of buffers by sk_add_backlog(),
which in TIPC leads to unnecessarily broken connections.
In this commit, we compensate for this double accounting by adding
a counter that keeps track of it. The function socket.c::backlog_rcv()
receives buffers one by one from __release_sock(), and adds them to the
socket receive queue. If the transfer is successful, it increases a new
atomic counter 'tipc_sock::dupl_rcvcnt' with 'truesize' of the
transferred buffer. If a new buffer arrives during this transfer and
finds the socket busy (owned), we attempt to add it to the backlog.
However, when sk_add_backlog() is called, we adjust the 'limit'
parameter with the value of the new counter, so that the risk of
inadvertent rejection is eliminated.
It should be noted that this change does not invalidate the original
purpose of zeroing 'sk_backlog.len' after the full transfer. We set an
upper limit for dupl_rcvcnt, so that if a 'wild' sender (i.e., one that
doesn't respect the send window) keeps pumping in buffers to
sk_add_backlog(), he will eventually reach an upper limit,
(2 x TIPC_CONN_OVERLOAD_LIMIT). After that, no messages can be added
to the backlog, and the connection will be broken. Ordinary, well-
behaved senders will never reach this buffer limit at all.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-14 16:39:09 +07:00
|
|
|
atomic_set(&tsk->dupl_rcvcnt, 0);
|
2008-05-13 05:42:28 +07:00
|
|
|
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
/* Start out with safe limits until we receive an advertised window */
|
|
|
|
tsk->snd_win = tsk_adv_blocks(RCVBUF_MIN);
|
|
|
|
tsk->rcv_win = tsk->snd_win;
|
|
|
|
|
2016-11-01 20:02:42 +07:00
|
|
|
if (tipc_sk_type_connectionless(sk)) {
|
2014-08-23 05:09:20 +07:00
|
|
|
tsk_set_unreturnable(tsk, true);
|
2008-04-15 14:22:02 +07:00
|
|
|
if (sock->type == SOCK_DGRAM)
|
2014-08-23 05:09:20 +07:00
|
|
|
tsk_set_unreliable(tsk, true);
|
2008-04-15 14:22:02 +07:00
|
|
|
}
|
2019-07-31 01:19:10 +07:00
|
|
|
__skb_queue_head_init(&tsk->mc_method.deferredq);
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_create(sk, NULL, TIPC_DUMP_NONE, " ");
|
2006-01-03 01:04:38 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
static void tipc_sk_callback(struct rcu_head *head)
|
|
|
|
{
|
|
|
|
struct tipc_sock *tsk = container_of(head, struct tipc_sock, rcu);
|
|
|
|
|
|
|
|
sock_put(&tsk->sk);
|
|
|
|
}
|
|
|
|
|
2016-11-01 20:02:47 +07:00
|
|
|
/* Caller should hold socket lock for the socket. */
|
|
|
|
static void __tipc_shutdown(struct socket *sock, int error)
|
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
|
|
|
struct net *net = sock_net(sk);
|
2019-11-28 10:10:07 +07:00
|
|
|
long timeout = msecs_to_jiffies(CONN_TIMEOUT_DEFAULT);
|
2016-11-01 20:02:47 +07:00
|
|
|
u32 dnode = tsk_peer_node(tsk);
|
|
|
|
struct sk_buff *skb;
|
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
/* Avoid that hi-prio shutdown msgs bypass msgs in link wakeup queue */
|
|
|
|
tipc_wait_for_cond(sock, &timeout, (!tsk->cong_link_cnt &&
|
|
|
|
!tsk_conn_cong(tsk)));
|
|
|
|
|
2019-11-28 10:10:08 +07:00
|
|
|
/* Push out delayed messages if in Nagle mode */
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
tipc_sk_push_backlog(tsk, false);
|
2019-11-28 10:10:08 +07:00
|
|
|
/* Remove pending SYN */
|
|
|
|
__skb_queue_purge(&sk->sk_write_queue);
|
2018-09-29 01:23:22 +07:00
|
|
|
|
2020-01-08 09:18:15 +07:00
|
|
|
/* Remove partially received buffer if any */
|
|
|
|
skb = skb_peek(&sk->sk_receive_queue);
|
|
|
|
if (skb && TIPC_SKB_CB(skb)->bytes_read) {
|
|
|
|
__skb_unlink(skb, &sk->sk_receive_queue);
|
|
|
|
kfree_skb(skb);
|
2016-11-01 20:02:47 +07:00
|
|
|
}
|
2016-12-22 19:22:29 +07:00
|
|
|
|
2020-01-08 09:18:15 +07:00
|
|
|
/* Reject all unreceived messages if connectionless */
|
|
|
|
if (tipc_sk_type_connectionless(sk)) {
|
|
|
|
tsk_rej_rx_queue(sk, error);
|
2016-12-22 19:22:29 +07:00
|
|
|
return;
|
2020-01-08 09:18:15 +07:00
|
|
|
}
|
2016-12-22 19:22:29 +07:00
|
|
|
|
2020-01-08 09:18:15 +07:00
|
|
|
switch (sk->sk_state) {
|
|
|
|
case TIPC_CONNECTING:
|
|
|
|
case TIPC_ESTABLISHED:
|
|
|
|
tipc_set_sk_state(sk, TIPC_DISCONNECTING);
|
|
|
|
tipc_node_remove_conn(net, dnode, tsk->portid);
|
|
|
|
/* Send a FIN+/- to its peer */
|
|
|
|
skb = __skb_dequeue(&sk->sk_receive_queue);
|
|
|
|
if (skb) {
|
|
|
|
__skb_queue_purge(&sk->sk_receive_queue);
|
|
|
|
tipc_sk_respond(sk, skb, error);
|
|
|
|
break;
|
|
|
|
}
|
2016-11-01 20:02:47 +07:00
|
|
|
skb = tipc_msg_create(TIPC_CRITICAL_IMPORTANCE,
|
|
|
|
TIPC_CONN_MSG, SHORT_H_SIZE, 0, dnode,
|
|
|
|
tsk_own_node(tsk), tsk_peer_port(tsk),
|
|
|
|
tsk->portid, error);
|
|
|
|
if (skb)
|
|
|
|
tipc_node_xmit_skb(net, skb, dnode, tsk->portid);
|
2020-01-08 09:18:15 +07:00
|
|
|
break;
|
|
|
|
case TIPC_LISTEN:
|
|
|
|
/* Reject all SYN messages */
|
|
|
|
tsk_rej_rx_queue(sk, error);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
__skb_queue_purge(&sk->sk_receive_queue);
|
|
|
|
break;
|
2016-11-01 20:02:47 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-01-03 01:04:38 +07:00
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_release - destroy a TIPC socket
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket to destroy
|
|
|
|
*
|
|
|
|
* This routine cleans up any messages that are still queued on the socket.
|
|
|
|
* For DGRAM and RDM socket types, all queued messages are rejected.
|
|
|
|
* For SEQPACKET and STREAM socket types, the first message is rejected
|
|
|
|
* and any others are discarded. (If the first message on a STREAM socket
|
|
|
|
* is partially-read, it is discarded and the next one is rejected instead.)
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* NOTE: Rejected messages are not necessarily returned to the sender! They
|
|
|
|
* are returned or discarded according to the "destination droppable" setting
|
|
|
|
* specified for the message by the sender.
|
|
|
|
*
|
|
|
|
* Returns 0 on success, errno otherwise
|
|
|
|
*/
|
2014-02-18 15:06:46 +07:00
|
|
|
static int tipc_release(struct socket *sock)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
2014-03-12 22:31:12 +07:00
|
|
|
struct tipc_sock *tsk;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
/*
|
|
|
|
* Exit if socket isn't fully initialized (occurs when a failed accept()
|
|
|
|
* releases a pre-allocated child socket that was never used)
|
|
|
|
*/
|
|
|
|
if (sk == NULL)
|
2006-01-03 01:04:38 +07:00
|
|
|
return 0;
|
2007-02-09 21:25:21 +07:00
|
|
|
|
2014-03-12 22:31:12 +07:00
|
|
|
tsk = tipc_sk(sk);
|
2008-04-15 14:22:02 +07:00
|
|
|
lock_sock(sk);
|
|
|
|
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_release(sk, NULL, TIPC_DUMP_ALL, " ");
|
2016-11-01 20:02:47 +07:00
|
|
|
__tipc_shutdown(sock, TIPC_ERR_NO_PORT);
|
|
|
|
sk->sk_shutdown = SHUTDOWN_MASK;
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
tipc_sk_leave(tsk);
|
2014-08-23 05:09:20 +07:00
|
|
|
tipc_sk_withdraw(tsk, 0, NULL);
|
tipc: smooth change between replicast and broadcast
Currently, a multicast stream may start out using replicast, because
there are few destinations, and then it should ideally switch to
L2/broadcast IGMP/multicast when the number of destinations grows beyond
a certain limit. The opposite should happen when the number decreases
below the limit.
To eliminate the risk of message reordering caused by method change,
a sending socket must stick to a previously selected method until it
enters an idle period of 5 seconds. Means there is a 5 seconds pause
in the traffic from the sender socket.
If the sender never makes such a pause, the method will never change,
and transmission may become very inefficient as the cluster grows.
With this commit, we allow such a switch between replicast and
broadcast without any need for a traffic pause.
Solution is to send a dummy message with only the header, also with
the SYN bit set, via broadcast or replicast. For the data message,
the SYN bit is set and sending via replicast or broadcast (inverse
method with dummy).
Then, at receiving side any messages follow first SYN bit message
(data or dummy message), they will be held in deferred queue until
another pair (dummy or data message) arrived in other link.
v2: reverse christmas tree declaration
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19 18:49:50 +07:00
|
|
|
__skb_queue_purge(&tsk->mc_method.deferredq);
|
2015-05-28 12:19:22 +07:00
|
|
|
sk_stop_timer(sk, &sk->sk_timer);
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
tipc_sk_remove(tsk);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2018-09-04 09:12:41 +07:00
|
|
|
sock_orphan(sk);
|
2008-04-15 14:22:02 +07:00
|
|
|
/* Reject any messages that accumulated in backlog queue */
|
|
|
|
release_sock(sk);
|
2017-10-13 16:04:22 +07:00
|
|
|
tipc_dest_list_purge(&tsk->cong_links);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
tsk->cong_link_cnt = 0;
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
call_rcu(&tsk->rcu, tipc_sk_callback);
|
2008-04-15 14:22:02 +07:00
|
|
|
sock->sk = NULL;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2014-04-06 20:56:14 +07:00
|
|
|
return 0;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_bind - associate or disassocate TIPC name(s) with a socket
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket structure
|
|
|
|
* @uaddr: socket address describing name(s) and desired operation
|
|
|
|
* @uaddr_len: size of socket address data structure
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Name and name sequence binding is indicated using a positive scope value;
|
|
|
|
* a negative scope value unbinds the specified name. Specifying no name
|
|
|
|
* (i.e. a socket address length of 0) unbinds all names from the socket.
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Returns 0 on success, errno otherwise
|
2008-04-15 14:22:02 +07:00
|
|
|
*
|
|
|
|
* NOTE: This routine doesn't need to take the socket lock since it doesn't
|
|
|
|
* access any non-constant socket information.
|
2006-01-03 01:04:38 +07:00
|
|
|
*/
|
2014-02-18 15:06:46 +07:00
|
|
|
static int tipc_bind(struct socket *sock, struct sockaddr *uaddr,
|
|
|
|
int uaddr_len)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2013-12-27 09:18:28 +07:00
|
|
|
struct sock *sk = sock->sk;
|
2006-01-03 01:04:38 +07:00
|
|
|
struct sockaddr_tipc *addr = (struct sockaddr_tipc *)uaddr;
|
2014-03-12 22:31:12 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
2013-12-27 09:18:28 +07:00
|
|
|
int res = -EINVAL;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2013-12-27 09:18:28 +07:00
|
|
|
lock_sock(sk);
|
|
|
|
if (unlikely(!uaddr_len)) {
|
2014-08-23 05:09:20 +07:00
|
|
|
res = tipc_sk_withdraw(tsk, 0, NULL);
|
2013-12-27 09:18:28 +07:00
|
|
|
goto exit;
|
|
|
|
}
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
if (tsk->group) {
|
|
|
|
res = -EACCES;
|
|
|
|
goto exit;
|
|
|
|
}
|
2013-12-27 09:18:28 +07:00
|
|
|
if (uaddr_len < sizeof(struct sockaddr_tipc)) {
|
|
|
|
res = -EINVAL;
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
if (addr->family != AF_TIPC) {
|
|
|
|
res = -EAFNOSUPPORT;
|
|
|
|
goto exit;
|
|
|
|
}
|
2006-01-03 01:04:38 +07:00
|
|
|
|
|
|
|
if (addr->addrtype == TIPC_ADDR_NAME)
|
|
|
|
addr->addr.nameseq.upper = addr->addr.nameseq.lower;
|
2013-12-27 09:18:28 +07:00
|
|
|
else if (addr->addrtype != TIPC_ADDR_NAMESEQ) {
|
|
|
|
res = -EAFNOSUPPORT;
|
|
|
|
goto exit;
|
|
|
|
}
|
2007-02-09 21:25:21 +07:00
|
|
|
|
tipc: convert topology server to use new server facility
As the new TIPC server infrastructure has been introduced, we can
now convert the TIPC topology server to it. We get two benefits
from doing this:
1) It simplifies the topology server locking policy. In the
original locking policy, we placed one spin lock pointer in the
tipc_subscriber structure to reuse the lock of the subscriber's
server port, controlling access to members of tipc_subscriber
instance. That is, we only used one lock to ensure both
tipc_port and tipc_subscriber members were safely accessed.
Now we introduce another spin lock for tipc_subscriber structure
only protecting themselves, to get a finer granularity locking
policy. Moreover, the change will allow us to make the topology
server code more readable and maintainable.
2) It fixes a bug where sent subscription events may be lost when
the topology port is congested. Using the new service, the
topology server now queues sent events into an outgoing buffer,
and then wakes up a sender process which has been blocked in
workqueue context. The process will keep picking events from the
buffer and send them to their respective subscribers, using the
kernel socket interface, until the buffer is empty. Even if the
socket is congested during transmission there is no risk that
events may be dropped, since the sender process may block when
needed.
Some minor reordering of initialization is done, since we now
have a scenario where the topology server must be started after
socket initialization has taken place, as the former depends
on the latter. And overall, we see a simplification of the
TIPC subscriber code in making this changeover.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 21:54:40 +07:00
|
|
|
if ((addr->addr.nameseq.type < TIPC_RESERVED_TYPES) &&
|
2013-06-17 21:54:41 +07:00
|
|
|
(addr->addr.nameseq.type != TIPC_TOP_SRV) &&
|
2013-12-27 09:18:28 +07:00
|
|
|
(addr->addr.nameseq.type != TIPC_CFG_SRV)) {
|
|
|
|
res = -EACCES;
|
|
|
|
goto exit;
|
|
|
|
}
|
2011-11-03 02:49:40 +07:00
|
|
|
|
2018-03-15 22:48:51 +07:00
|
|
|
res = (addr->scope >= 0) ?
|
2014-08-23 05:09:20 +07:00
|
|
|
tipc_sk_publish(tsk, addr->scope, &addr->addr.nameseq) :
|
|
|
|
tipc_sk_withdraw(tsk, -addr->scope, &addr->addr.nameseq);
|
2013-12-27 09:18:28 +07:00
|
|
|
exit:
|
|
|
|
release_sock(sk);
|
|
|
|
return res;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_getname - get port ID of socket or peer socket
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket structure
|
|
|
|
* @uaddr: area for returned socket address
|
2008-07-15 12:43:32 +07:00
|
|
|
* @peer: 0 = own ID, 1 = current peer ID, 2 = current/former peer ID
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Returns 0 on success, errno otherwise
|
2008-04-15 14:22:02 +07:00
|
|
|
*
|
2008-07-15 12:43:32 +07:00
|
|
|
* NOTE: This routine doesn't need to take the socket lock since it only
|
|
|
|
* accesses socket information that is unchanging (or which changes in
|
2011-01-01 01:59:32 +07:00
|
|
|
* a completely predictable manner).
|
2006-01-03 01:04:38 +07:00
|
|
|
*/
|
2014-02-18 15:06:46 +07:00
|
|
|
static int tipc_getname(struct socket *sock, struct sockaddr *uaddr,
|
2018-02-13 02:00:20 +07:00
|
|
|
int peer)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
|
|
|
struct sockaddr_tipc *addr = (struct sockaddr_tipc *)uaddr;
|
2016-11-01 20:02:46 +07:00
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2010-10-31 14:10:32 +07:00
|
|
|
memset(addr, 0, sizeof(*addr));
|
2008-04-15 14:22:02 +07:00
|
|
|
if (peer) {
|
2016-11-01 20:02:49 +07:00
|
|
|
if ((!tipc_sk_connected(sk)) &&
|
2016-11-01 20:02:46 +07:00
|
|
|
((peer != 2) || (sk->sk_state != TIPC_DISCONNECTING)))
|
2008-07-15 12:43:32 +07:00
|
|
|
return -ENOTCONN;
|
2014-08-23 05:09:20 +07:00
|
|
|
addr->addr.id.ref = tsk_peer_port(tsk);
|
|
|
|
addr->addr.id.node = tsk_peer_node(tsk);
|
2008-04-15 14:22:02 +07:00
|
|
|
} else {
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
addr->addr.id.ref = tsk->portid;
|
2018-03-23 02:42:49 +07:00
|
|
|
addr->addr.id.node = tipc_own_addr(sock_net(sk));
|
2008-04-15 14:22:02 +07:00
|
|
|
}
|
2006-01-03 01:04:38 +07:00
|
|
|
|
|
|
|
addr->addrtype = TIPC_ADDR_ID;
|
|
|
|
addr->family = AF_TIPC;
|
|
|
|
addr->scope = 0;
|
|
|
|
addr->addr.name.domain = 0;
|
|
|
|
|
2018-02-13 02:00:20 +07:00
|
|
|
return sizeof(*addr);
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2018-06-28 23:43:44 +07:00
|
|
|
* tipc_poll - read and possibly block on pollmask
|
2006-01-03 01:04:38 +07:00
|
|
|
* @file: file structure associated with the socket
|
|
|
|
* @sock: socket for which to calculate the poll bits
|
2018-06-28 23:43:44 +07:00
|
|
|
* @wait: ???
|
2006-01-03 01:04:38 +07:00
|
|
|
*
|
2008-03-27 06:48:21 +07:00
|
|
|
* Returns pollmask value
|
|
|
|
*
|
|
|
|
* COMMENTARY:
|
|
|
|
* It appears that the usual socket locking mechanisms are not useful here
|
|
|
|
* since the pollmask info is potentially out-of-date the moment this routine
|
|
|
|
* exits. TCP and other protocols seem to rely on higher level poll routines
|
|
|
|
* to handle any preventable race conditions, so TIPC will do the same ...
|
|
|
|
*
|
2010-08-17 18:00:06 +07:00
|
|
|
* IMPORTANT: The fact that a read or write operation is indicated does NOT
|
|
|
|
* imply that the operation will succeed, merely that it should be performed
|
|
|
|
* and will not block.
|
2006-01-03 01:04:38 +07:00
|
|
|
*/
|
2018-06-28 23:43:44 +07:00
|
|
|
static __poll_t tipc_poll(struct file *file, struct socket *sock,
|
|
|
|
poll_table *wait)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-03-27 06:48:21 +07:00
|
|
|
struct sock *sk = sock->sk;
|
2014-03-12 22:31:12 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
2017-07-03 11:01:49 +07:00
|
|
|
__poll_t revents = 0;
|
2008-03-27 06:48:21 +07:00
|
|
|
|
2018-10-23 18:40:39 +07:00
|
|
|
sock_poll_wait(file, sock, wait);
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_poll(sk, NULL, TIPC_DUMP_ALL, " ");
|
2018-06-28 23:43:44 +07:00
|
|
|
|
2016-11-01 20:02:47 +07:00
|
|
|
if (sk->sk_shutdown & RCV_SHUTDOWN)
|
2018-02-12 05:34:03 +07:00
|
|
|
revents |= EPOLLRDHUP | EPOLLIN | EPOLLRDNORM;
|
2016-11-01 20:02:47 +07:00
|
|
|
if (sk->sk_shutdown == SHUTDOWN_MASK)
|
2018-02-12 05:34:03 +07:00
|
|
|
revents |= EPOLLHUP;
|
2016-11-01 20:02:47 +07:00
|
|
|
|
2016-11-01 20:02:49 +07:00
|
|
|
switch (sk->sk_state) {
|
|
|
|
case TIPC_ESTABLISHED:
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
if (!tsk->cong_link_cnt && !tsk_conn_cong(tsk))
|
2018-02-12 05:34:03 +07:00
|
|
|
revents |= EPOLLOUT;
|
2020-08-24 05:36:59 +07:00
|
|
|
fallthrough;
|
2016-11-01 20:02:49 +07:00
|
|
|
case TIPC_LISTEN:
|
2019-05-09 12:13:42 +07:00
|
|
|
case TIPC_CONNECTING:
|
2019-10-24 12:44:50 +07:00
|
|
|
if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
|
2018-02-12 05:34:03 +07:00
|
|
|
revents |= EPOLLIN | EPOLLRDNORM;
|
2016-11-01 20:02:49 +07:00
|
|
|
break;
|
|
|
|
case TIPC_OPEN:
|
2018-01-17 22:42:46 +07:00
|
|
|
if (tsk->group_is_open && !tsk->cong_link_cnt)
|
2018-02-12 05:34:03 +07:00
|
|
|
revents |= EPOLLOUT;
|
2017-10-13 16:04:25 +07:00
|
|
|
if (!tipc_sk_type_connectionless(sk))
|
|
|
|
break;
|
2019-10-24 12:44:50 +07:00
|
|
|
if (skb_queue_empty_lockless(&sk->sk_receive_queue))
|
2017-10-13 16:04:25 +07:00
|
|
|
break;
|
2018-02-12 05:34:03 +07:00
|
|
|
revents |= EPOLLIN | EPOLLRDNORM;
|
2016-11-01 20:02:49 +07:00
|
|
|
break;
|
|
|
|
case TIPC_DISCONNECTING:
|
2018-02-12 05:34:03 +07:00
|
|
|
revents = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
|
2016-11-01 20:02:49 +07:00
|
|
|
break;
|
2010-08-17 18:00:06 +07:00
|
|
|
}
|
2017-10-13 16:04:25 +07:00
|
|
|
return revents;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2014-07-17 07:41:01 +07:00
|
|
|
/**
|
|
|
|
* tipc_sendmcast - send multicast message
|
|
|
|
* @sock: socket structure
|
|
|
|
* @seq: destination address
|
2014-11-15 13:13:43 +07:00
|
|
|
* @msg: message to send
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
* @dlen: length of data to send
|
|
|
|
* @timeout: timeout to wait for wakeup
|
2014-07-17 07:41:01 +07:00
|
|
|
*
|
|
|
|
* Called from function tipc_sendmsg(), which has done all sanity checks
|
|
|
|
* Returns the number of bytes sent on success, or errno
|
|
|
|
*/
|
|
|
|
static int tipc_sendmcast(struct socket *sock, struct tipc_name_seq *seq,
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
struct msghdr *msg, size_t dlen, long timeout)
|
2014-07-17 07:41:01 +07:00
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
2015-02-05 20:36:36 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
struct tipc_msg *hdr = &tsk->phdr;
|
2015-01-09 14:27:05 +07:00
|
|
|
struct net *net = sock_net(sk);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
int mtu = tipc_bcast_get_mtu(net);
|
2017-01-19 01:50:53 +07:00
|
|
|
struct tipc_mc_method *method = &tsk->mc_method;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
struct sk_buff_head pkts;
|
2017-01-19 01:50:52 +07:00
|
|
|
struct tipc_nlist dsts;
|
2014-07-17 07:41:01 +07:00
|
|
|
int rc;
|
|
|
|
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
if (tsk->group)
|
|
|
|
return -EACCES;
|
|
|
|
|
2017-01-19 01:50:52 +07:00
|
|
|
/* Block or return if any destination link is congested */
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
rc = tipc_wait_for_cond(sock, &timeout, !tsk->cong_link_cnt);
|
|
|
|
if (unlikely(rc))
|
|
|
|
return rc;
|
2016-03-01 17:07:09 +07:00
|
|
|
|
2017-01-19 01:50:52 +07:00
|
|
|
/* Lookup destination nodes */
|
|
|
|
tipc_nlist_init(&dsts, tipc_own_addr(net));
|
|
|
|
tipc_nametbl_lookup_dst_nodes(net, seq->type, seq->lower,
|
2018-01-13 02:56:50 +07:00
|
|
|
seq->upper, &dsts);
|
2017-01-19 01:50:52 +07:00
|
|
|
if (!dsts.local && !dsts.remote)
|
|
|
|
return -EHOSTUNREACH;
|
|
|
|
|
|
|
|
/* Build message header */
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
msg_set_type(hdr, TIPC_MCAST_MSG);
|
2017-01-19 01:50:52 +07:00
|
|
|
msg_set_hdr_sz(hdr, MCAST_H_SIZE);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
msg_set_lookup_scope(hdr, TIPC_CLUSTER_SCOPE);
|
|
|
|
msg_set_destport(hdr, 0);
|
|
|
|
msg_set_destnode(hdr, 0);
|
|
|
|
msg_set_nametype(hdr, seq->type);
|
|
|
|
msg_set_namelower(hdr, seq->lower);
|
|
|
|
msg_set_nameupper(hdr, seq->upper);
|
|
|
|
|
2017-01-19 01:50:52 +07:00
|
|
|
/* Build message as chain of buffers */
|
2019-08-15 21:42:50 +07:00
|
|
|
__skb_queue_head_init(&pkts);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
rc = tipc_msg_build(hdr, msg, 0, dlen, mtu, &pkts);
|
2014-07-17 07:41:01 +07:00
|
|
|
|
2017-01-19 01:50:52 +07:00
|
|
|
/* Send message if build was successful */
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
if (unlikely(rc == dlen)) {
|
|
|
|
trace_tipc_sk_sendmcast(sk, skb_peek(&pkts),
|
|
|
|
TIPC_DUMP_SK_SNDQ, " ");
|
2017-01-19 01:50:53 +07:00
|
|
|
rc = tipc_mcast_xmit(net, &pkts, method, &dsts,
|
2017-01-19 01:50:52 +07:00
|
|
|
&tsk->cong_link_cnt);
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
}
|
2017-01-19 01:50:52 +07:00
|
|
|
|
|
|
|
tipc_nlist_purge(&dsts);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
|
|
|
|
return rc ? rc : dlen;
|
2014-07-17 07:41:01 +07:00
|
|
|
}
|
|
|
|
|
2017-10-13 16:04:27 +07:00
|
|
|
/**
|
|
|
|
* tipc_send_group_msg - send a message to a member in the group
|
|
|
|
* @net: network namespace
|
|
|
|
* @m: message to send
|
|
|
|
* @mb: group member
|
|
|
|
* @dnode: destination node
|
|
|
|
* @dport: destination port
|
|
|
|
* @dlen: total length of message data
|
|
|
|
*/
|
|
|
|
static int tipc_send_group_msg(struct net *net, struct tipc_sock *tsk,
|
|
|
|
struct msghdr *m, struct tipc_member *mb,
|
|
|
|
u32 dnode, u32 dport, int dlen)
|
|
|
|
{
|
2017-10-13 16:04:30 +07:00
|
|
|
u16 bc_snd_nxt = tipc_group_bc_snd_nxt(tsk->group);
|
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:31 +07:00
|
|
|
struct tipc_mc_method *method = &tsk->mc_method;
|
2017-10-13 16:04:27 +07:00
|
|
|
int blks = tsk_blocks(GROUP_H_SIZE + dlen);
|
|
|
|
struct tipc_msg *hdr = &tsk->phdr;
|
|
|
|
struct sk_buff_head pkts;
|
|
|
|
int mtu, rc;
|
|
|
|
|
|
|
|
/* Complete message header */
|
|
|
|
msg_set_type(hdr, TIPC_GRP_UCAST_MSG);
|
|
|
|
msg_set_hdr_sz(hdr, GROUP_H_SIZE);
|
|
|
|
msg_set_destport(hdr, dport);
|
|
|
|
msg_set_destnode(hdr, dnode);
|
2017-10-13 16:04:30 +07:00
|
|
|
msg_set_grp_bc_seqno(hdr, bc_snd_nxt);
|
2017-10-13 16:04:27 +07:00
|
|
|
|
|
|
|
/* Build message as chain of buffers */
|
2019-08-15 21:42:50 +07:00
|
|
|
__skb_queue_head_init(&pkts);
|
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
|
|
|
mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false);
|
2017-10-13 16:04:27 +07:00
|
|
|
rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts);
|
|
|
|
if (unlikely(rc != dlen))
|
|
|
|
return rc;
|
|
|
|
|
|
|
|
/* Send message */
|
|
|
|
rc = tipc_node_xmit(net, &pkts, dnode, tsk->portid);
|
|
|
|
if (unlikely(rc == -ELINKCONG)) {
|
|
|
|
tipc_dest_push(&tsk->cong_links, dnode, 0);
|
|
|
|
tsk->cong_link_cnt++;
|
|
|
|
}
|
|
|
|
|
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:31 +07:00
|
|
|
/* Update send window */
|
2017-10-13 16:04:27 +07:00
|
|
|
tipc_group_update_member(mb, blks);
|
|
|
|
|
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:31 +07:00
|
|
|
/* A broadcast sent within next EXPIRE period must follow same path */
|
|
|
|
method->rcast = true;
|
|
|
|
method->mandatory = true;
|
2017-10-13 16:04:27 +07:00
|
|
|
return dlen;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* tipc_send_group_unicast - send message to a member in the group
|
|
|
|
* @sock: socket structure
|
|
|
|
* @m: message to send
|
|
|
|
* @dlen: total length of message data
|
|
|
|
* @timeout: timeout to wait for wakeup
|
|
|
|
*
|
|
|
|
* Called from function tipc_sendmsg(), which has done all sanity checks
|
|
|
|
* Returns the number of bytes sent on success, or errno
|
|
|
|
*/
|
|
|
|
static int tipc_send_group_unicast(struct socket *sock, struct msghdr *m,
|
|
|
|
int dlen, long timeout)
|
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name);
|
|
|
|
int blks = tsk_blocks(GROUP_H_SIZE + dlen);
|
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
|
|
|
struct net *net = sock_net(sk);
|
|
|
|
struct tipc_member *mb = NULL;
|
|
|
|
u32 node, port;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
node = dest->addr.id.node;
|
|
|
|
port = dest->addr.id.ref;
|
|
|
|
if (!port && !node)
|
|
|
|
return -EHOSTUNREACH;
|
|
|
|
|
|
|
|
/* Block or return if destination link or member is congested */
|
|
|
|
rc = tipc_wait_for_cond(sock, &timeout,
|
|
|
|
!tipc_dest_find(&tsk->cong_links, node, 0) &&
|
2018-12-12 12:43:51 +07:00
|
|
|
tsk->group &&
|
|
|
|
!tipc_group_cong(tsk->group, node, port, blks,
|
|
|
|
&mb));
|
2017-10-13 16:04:27 +07:00
|
|
|
if (unlikely(rc))
|
|
|
|
return rc;
|
|
|
|
|
|
|
|
if (unlikely(!mb))
|
|
|
|
return -EHOSTUNREACH;
|
|
|
|
|
|
|
|
rc = tipc_send_group_msg(net, tsk, m, mb, node, port, dlen);
|
|
|
|
|
|
|
|
return rc ? rc : dlen;
|
|
|
|
}
|
|
|
|
|
2017-10-13 16:04:28 +07:00
|
|
|
/**
|
|
|
|
* tipc_send_group_anycast - send message to any member with given identity
|
|
|
|
* @sock: socket structure
|
|
|
|
* @m: message to send
|
|
|
|
* @dlen: total length of message data
|
|
|
|
* @timeout: timeout to wait for wakeup
|
|
|
|
*
|
|
|
|
* Called from function tipc_sendmsg(), which has done all sanity checks
|
|
|
|
* Returns the number of bytes sent on success, or errno
|
|
|
|
*/
|
|
|
|
static int tipc_send_group_anycast(struct socket *sock, struct msghdr *m,
|
|
|
|
int dlen, long timeout)
|
|
|
|
{
|
|
|
|
DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name);
|
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
|
|
|
struct list_head *cong_links = &tsk->cong_links;
|
|
|
|
int blks = tsk_blocks(GROUP_H_SIZE + dlen);
|
2018-01-09 03:03:30 +07:00
|
|
|
struct tipc_msg *hdr = &tsk->phdr;
|
2017-10-13 16:04:28 +07:00
|
|
|
struct tipc_member *first = NULL;
|
|
|
|
struct tipc_member *mbr = NULL;
|
|
|
|
struct net *net = sock_net(sk);
|
|
|
|
u32 node, port, exclude;
|
|
|
|
struct list_head dsts;
|
2018-01-09 03:03:30 +07:00
|
|
|
u32 type, inst, scope;
|
2017-10-13 16:04:28 +07:00
|
|
|
int lookups = 0;
|
|
|
|
int dstcnt, rc;
|
|
|
|
bool cong;
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&dsts);
|
|
|
|
|
2018-01-09 03:03:30 +07:00
|
|
|
type = msg_nametype(hdr);
|
2017-10-13 16:04:28 +07:00
|
|
|
inst = dest->addr.name.name.instance;
|
2018-01-09 03:03:30 +07:00
|
|
|
scope = msg_lookup_scope(hdr);
|
2017-10-13 16:04:28 +07:00
|
|
|
|
|
|
|
while (++lookups < 4) {
|
2018-12-12 12:43:51 +07:00
|
|
|
exclude = tipc_group_exclude(tsk->group);
|
|
|
|
|
2017-10-13 16:04:28 +07:00
|
|
|
first = NULL;
|
|
|
|
|
|
|
|
/* Look for a non-congested destination member, if any */
|
|
|
|
while (1) {
|
2018-01-09 03:03:30 +07:00
|
|
|
if (!tipc_nametbl_lookup(net, type, inst, scope, &dsts,
|
2017-10-13 16:04:28 +07:00
|
|
|
&dstcnt, exclude, false))
|
|
|
|
return -EHOSTUNREACH;
|
|
|
|
tipc_dest_pop(&dsts, &node, &port);
|
2018-12-12 12:43:51 +07:00
|
|
|
cong = tipc_group_cong(tsk->group, node, port, blks,
|
|
|
|
&mbr);
|
2017-10-13 16:04:28 +07:00
|
|
|
if (!cong)
|
|
|
|
break;
|
|
|
|
if (mbr == first)
|
|
|
|
break;
|
|
|
|
if (!first)
|
|
|
|
first = mbr;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Start over if destination was not in member list */
|
|
|
|
if (unlikely(!mbr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (likely(!cong && !tipc_dest_find(cong_links, node, 0)))
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* Block or return if destination link or member is congested */
|
|
|
|
rc = tipc_wait_for_cond(sock, &timeout,
|
|
|
|
!tipc_dest_find(cong_links, node, 0) &&
|
2018-12-12 12:43:51 +07:00
|
|
|
tsk->group &&
|
|
|
|
!tipc_group_cong(tsk->group, node, port,
|
2017-10-13 16:04:28 +07:00
|
|
|
blks, &mbr));
|
|
|
|
if (unlikely(rc))
|
|
|
|
return rc;
|
|
|
|
|
|
|
|
/* Send, unless destination disappeared while waiting */
|
|
|
|
if (likely(mbr))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (unlikely(lookups >= 4))
|
|
|
|
return -EHOSTUNREACH;
|
|
|
|
|
|
|
|
rc = tipc_send_group_msg(net, tsk, m, mbr, node, port, dlen);
|
|
|
|
|
|
|
|
return rc ? rc : dlen;
|
|
|
|
}
|
|
|
|
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
/**
|
|
|
|
* tipc_send_group_bcast - send message to all members in communication group
|
2020-07-13 06:15:14 +07:00
|
|
|
* @sock: socket structure
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
* @m: message to send
|
|
|
|
* @dlen: total length of message data
|
|
|
|
* @timeout: timeout to wait for wakeup
|
|
|
|
*
|
|
|
|
* Called from function tipc_sendmsg(), which has done all sanity checks
|
|
|
|
* Returns the number of bytes sent on success, or errno
|
|
|
|
*/
|
|
|
|
static int tipc_send_group_bcast(struct socket *sock, struct msghdr *m,
|
|
|
|
int dlen, long timeout)
|
|
|
|
{
|
2017-10-13 16:04:29 +07:00
|
|
|
DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
struct net *net = sock_net(sk);
|
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
2018-12-17 14:25:12 +07:00
|
|
|
struct tipc_nlist *dsts;
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
struct tipc_mc_method *method = &tsk->mc_method;
|
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:31 +07:00
|
|
|
bool ack = method->mandatory && method->rcast;
|
2017-10-13 16:04:26 +07:00
|
|
|
int blks = tsk_blocks(MCAST_H_SIZE + dlen);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
struct tipc_msg *hdr = &tsk->phdr;
|
|
|
|
int mtu = tipc_bcast_get_mtu(net);
|
|
|
|
struct sk_buff_head pkts;
|
|
|
|
int rc = -EHOSTUNREACH;
|
|
|
|
|
2017-10-13 16:04:26 +07:00
|
|
|
/* Block or return if any destination link or member is congested */
|
2018-12-12 12:43:51 +07:00
|
|
|
rc = tipc_wait_for_cond(sock, &timeout,
|
|
|
|
!tsk->cong_link_cnt && tsk->group &&
|
|
|
|
!tipc_group_bc_cong(tsk->group, blks));
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
if (unlikely(rc))
|
|
|
|
return rc;
|
|
|
|
|
2018-12-17 14:25:12 +07:00
|
|
|
dsts = tipc_group_dests(tsk->group);
|
|
|
|
if (!dsts->local && !dsts->remote)
|
|
|
|
return -EHOSTUNREACH;
|
|
|
|
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
/* Complete message header */
|
2017-10-13 16:04:29 +07:00
|
|
|
if (dest) {
|
|
|
|
msg_set_type(hdr, TIPC_GRP_MCAST_MSG);
|
|
|
|
msg_set_nameinst(hdr, dest->addr.name.name.instance);
|
|
|
|
} else {
|
|
|
|
msg_set_type(hdr, TIPC_GRP_BCAST_MSG);
|
|
|
|
msg_set_nameinst(hdr, 0);
|
|
|
|
}
|
2017-10-13 16:04:26 +07:00
|
|
|
msg_set_hdr_sz(hdr, GROUP_H_SIZE);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
msg_set_destport(hdr, 0);
|
|
|
|
msg_set_destnode(hdr, 0);
|
2018-12-12 12:43:51 +07:00
|
|
|
msg_set_grp_bc_seqno(hdr, tipc_group_bc_snd_nxt(tsk->group));
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
|
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:31 +07:00
|
|
|
/* Avoid getting stuck with repeated forced replicasts */
|
|
|
|
msg_set_grp_bc_ack_req(hdr, ack);
|
|
|
|
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
/* Build message as chain of buffers */
|
2019-08-15 21:42:50 +07:00
|
|
|
__skb_queue_head_init(&pkts);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts);
|
|
|
|
if (unlikely(rc != dlen))
|
|
|
|
return rc;
|
|
|
|
|
|
|
|
/* Send message */
|
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:31 +07:00
|
|
|
rc = tipc_mcast_xmit(net, &pkts, method, dsts, &tsk->cong_link_cnt);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
if (unlikely(rc))
|
|
|
|
return rc;
|
|
|
|
|
2017-10-13 16:04:26 +07:00
|
|
|
/* Update broadcast sequence number and send windows */
|
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:31 +07:00
|
|
|
tipc_group_update_bc_members(tsk->group, blks, ack);
|
|
|
|
|
|
|
|
/* Broadcast link is now free to choose method for next broadcast */
|
|
|
|
method->mandatory = false;
|
|
|
|
method->expires = jiffies;
|
|
|
|
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
return dlen;
|
|
|
|
}
|
|
|
|
|
2017-10-13 16:04:29 +07:00
|
|
|
/**
|
|
|
|
* tipc_send_group_mcast - send message to all members with given identity
|
|
|
|
* @sock: socket structure
|
|
|
|
* @m: message to send
|
|
|
|
* @dlen: total length of message data
|
|
|
|
* @timeout: timeout to wait for wakeup
|
|
|
|
*
|
|
|
|
* Called from function tipc_sendmsg(), which has done all sanity checks
|
|
|
|
* Returns the number of bytes sent on success, or errno
|
|
|
|
*/
|
|
|
|
static int tipc_send_group_mcast(struct socket *sock, struct msghdr *m,
|
|
|
|
int dlen, long timeout)
|
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name);
|
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
|
|
|
struct tipc_group *grp = tsk->group;
|
2018-01-09 03:03:30 +07:00
|
|
|
struct tipc_msg *hdr = &tsk->phdr;
|
2017-10-13 16:04:29 +07:00
|
|
|
struct net *net = sock_net(sk);
|
2018-01-09 03:03:30 +07:00
|
|
|
u32 type, inst, scope, exclude;
|
2017-10-13 16:04:29 +07:00
|
|
|
struct list_head dsts;
|
2018-01-09 03:03:30 +07:00
|
|
|
u32 dstcnt;
|
2017-10-13 16:04:29 +07:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(&dsts);
|
|
|
|
|
2018-01-09 03:03:30 +07:00
|
|
|
type = msg_nametype(hdr);
|
|
|
|
inst = dest->addr.name.name.instance;
|
|
|
|
scope = msg_lookup_scope(hdr);
|
2017-10-13 16:04:29 +07:00
|
|
|
exclude = tipc_group_exclude(grp);
|
2018-01-09 03:03:30 +07:00
|
|
|
|
|
|
|
if (!tipc_nametbl_lookup(net, type, inst, scope, &dsts,
|
|
|
|
&dstcnt, exclude, true))
|
2017-10-13 16:04:29 +07:00
|
|
|
return -EHOSTUNREACH;
|
|
|
|
|
|
|
|
if (dstcnt == 1) {
|
|
|
|
tipc_dest_pop(&dsts, &dest->addr.id.node, &dest->addr.id.ref);
|
|
|
|
return tipc_send_group_unicast(sock, m, dlen, timeout);
|
|
|
|
}
|
|
|
|
|
|
|
|
tipc_dest_list_purge(&dsts);
|
|
|
|
return tipc_send_group_bcast(sock, m, dlen, timeout);
|
|
|
|
}
|
|
|
|
|
2015-02-05 20:36:44 +07:00
|
|
|
/**
|
|
|
|
* tipc_sk_mcast_rcv - Deliver multicast messages to all destination sockets
|
|
|
|
* @arrvq: queue with arriving messages, to be cloned after destination lookup
|
|
|
|
* @inputq: queue with cloned messages, delivered to socket after dest lookup
|
|
|
|
*
|
|
|
|
* Multi-threaded: parallel calls with reference to same queues may occur
|
2014-07-17 07:41:00 +07:00
|
|
|
*/
|
2015-02-05 20:36:44 +07:00
|
|
|
void tipc_sk_mcast_rcv(struct net *net, struct sk_buff_head *arrvq,
|
|
|
|
struct sk_buff_head *inputq)
|
2014-07-17 07:41:00 +07:00
|
|
|
{
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
u32 self = tipc_own_addr(net);
|
2018-01-09 03:03:30 +07:00
|
|
|
u32 type, lower, upper, scope;
|
2015-02-05 20:36:44 +07:00
|
|
|
struct sk_buff *skb, *_skb;
|
2018-07-31 23:01:37 +07:00
|
|
|
u32 portid, onode;
|
2018-01-09 03:03:30 +07:00
|
|
|
struct sk_buff_head tmpq;
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
struct list_head dports;
|
2018-01-09 03:03:30 +07:00
|
|
|
struct tipc_msg *hdr;
|
|
|
|
int user, mtyp, hlen;
|
|
|
|
bool exact;
|
2015-02-05 20:36:43 +07:00
|
|
|
|
2015-02-05 20:36:44 +07:00
|
|
|
__skb_queue_head_init(&tmpq);
|
2017-01-03 22:55:10 +07:00
|
|
|
INIT_LIST_HEAD(&dports);
|
2014-07-17 07:41:00 +07:00
|
|
|
|
2015-02-05 20:36:44 +07:00
|
|
|
skb = tipc_skb_peek(arrvq, &inputq->lock);
|
|
|
|
for (; skb; skb = tipc_skb_peek(arrvq, &inputq->lock)) {
|
2018-01-09 03:03:30 +07:00
|
|
|
hdr = buf_msg(skb);
|
|
|
|
user = msg_user(hdr);
|
|
|
|
mtyp = msg_type(hdr);
|
|
|
|
hlen = skb_headroom(skb) + msg_hdr_sz(hdr);
|
|
|
|
onode = msg_orignode(hdr);
|
|
|
|
type = msg_nametype(hdr);
|
|
|
|
|
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:31 +07:00
|
|
|
if (mtyp == TIPC_GRP_UCAST_MSG || user == GROUP_PROTOCOL) {
|
|
|
|
spin_lock_bh(&inputq->lock);
|
|
|
|
if (skb_peek(arrvq) == skb) {
|
|
|
|
__skb_dequeue(arrvq);
|
|
|
|
__skb_queue_tail(inputq, skb);
|
|
|
|
}
|
2017-12-12 01:11:55 +07:00
|
|
|
kfree_skb(skb);
|
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:31 +07:00
|
|
|
spin_unlock_bh(&inputq->lock);
|
|
|
|
continue;
|
|
|
|
}
|
2018-01-09 03:03:30 +07:00
|
|
|
|
|
|
|
/* Group messages require exact scope match */
|
|
|
|
if (msg_in_group(hdr)) {
|
|
|
|
lower = 0;
|
|
|
|
upper = ~0;
|
|
|
|
scope = msg_lookup_scope(hdr);
|
|
|
|
exact = true;
|
|
|
|
} else {
|
|
|
|
/* TIPC_NODE_SCOPE means "any scope" in this context */
|
|
|
|
if (onode == self)
|
|
|
|
scope = TIPC_NODE_SCOPE;
|
|
|
|
else
|
|
|
|
scope = TIPC_CLUSTER_SCOPE;
|
|
|
|
exact = false;
|
|
|
|
lower = msg_namelower(hdr);
|
|
|
|
upper = msg_nameupper(hdr);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
}
|
2018-01-09 03:03:30 +07:00
|
|
|
|
|
|
|
/* Create destination port list: */
|
|
|
|
tipc_nametbl_mc_lookup(net, type, lower, upper,
|
|
|
|
scope, exact, &dports);
|
|
|
|
|
|
|
|
/* Clone message per destination */
|
2017-10-13 16:04:22 +07:00
|
|
|
while (tipc_dest_pop(&dports, NULL, &portid)) {
|
2018-01-09 03:03:30 +07:00
|
|
|
_skb = __pskb_copy(skb, hlen, GFP_ATOMIC);
|
2015-02-05 20:36:44 +07:00
|
|
|
if (_skb) {
|
|
|
|
msg_set_destport(buf_msg(_skb), portid);
|
|
|
|
__skb_queue_tail(&tmpq, _skb);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
pr_warn("Failed to clone mcast rcv buffer\n");
|
2014-07-17 07:41:00 +07:00
|
|
|
}
|
2015-02-05 20:36:44 +07:00
|
|
|
/* Append to inputq if not already done by other thread */
|
|
|
|
spin_lock_bh(&inputq->lock);
|
|
|
|
if (skb_peek(arrvq) == skb) {
|
|
|
|
skb_queue_splice_tail_init(&tmpq, inputq);
|
|
|
|
kfree_skb(__skb_dequeue(arrvq));
|
|
|
|
}
|
|
|
|
spin_unlock_bh(&inputq->lock);
|
|
|
|
__skb_queue_purge(&tmpq);
|
|
|
|
kfree_skb(skb);
|
2014-07-17 07:41:00 +07:00
|
|
|
}
|
2015-02-05 20:36:44 +07:00
|
|
|
tipc_sk_rcv(net, inputq);
|
2014-07-17 07:41:00 +07:00
|
|
|
}
|
|
|
|
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
/* tipc_sk_push_backlog(): send accumulated buffers in socket write queue
|
|
|
|
* when socket is in Nagle mode
|
|
|
|
*/
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
static void tipc_sk_push_backlog(struct tipc_sock *tsk, bool nagle_ack)
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
{
|
|
|
|
struct sk_buff_head *txq = &tsk->sk.sk_write_queue;
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
struct sk_buff *skb = skb_peek_tail(txq);
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
struct net *net = sock_net(&tsk->sk);
|
|
|
|
u32 dnode = tsk_peer_node(tsk);
|
|
|
|
int rc;
|
|
|
|
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
if (nagle_ack) {
|
|
|
|
tsk->pkt_cnt += skb_queue_len(txq);
|
|
|
|
if (!tsk->pkt_cnt || tsk->msg_acc / tsk->pkt_cnt < 2) {
|
|
|
|
tsk->oneway = 0;
|
|
|
|
if (tsk->nagle_start < NAGLE_START_MAX)
|
|
|
|
tsk->nagle_start *= 2;
|
|
|
|
tsk->expect_ack = false;
|
|
|
|
pr_debug("tsk %10u: bad nagle %u -> %u, next start %u!\n",
|
|
|
|
tsk->portid, tsk->msg_acc, tsk->pkt_cnt,
|
|
|
|
tsk->nagle_start);
|
|
|
|
} else {
|
|
|
|
tsk->nagle_start = NAGLE_START_INIT;
|
|
|
|
if (skb) {
|
|
|
|
msg_set_ack_required(buf_msg(skb));
|
|
|
|
tsk->expect_ack = true;
|
|
|
|
} else {
|
|
|
|
tsk->expect_ack = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
tsk->msg_acc = 0;
|
|
|
|
tsk->pkt_cnt = 0;
|
|
|
|
}
|
|
|
|
|
2019-11-28 10:10:08 +07:00
|
|
|
if (!skb || tsk->cong_link_cnt)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Do not send SYN again after congestion */
|
|
|
|
if (msg_is_syn(buf_msg(skb)))
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
return;
|
|
|
|
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
if (tsk->msg_acc)
|
|
|
|
tsk->pkt_cnt += skb_queue_len(txq);
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
tsk->snt_unacked += tsk->snd_backlog;
|
|
|
|
tsk->snd_backlog = 0;
|
|
|
|
rc = tipc_node_xmit(net, txq, dnode, tsk->portid);
|
|
|
|
if (rc == -ELINKCONG)
|
|
|
|
tsk->cong_link_cnt = 1;
|
|
|
|
}
|
|
|
|
|
2014-06-26 08:41:41 +07:00
|
|
|
/**
|
2017-10-13 16:04:20 +07:00
|
|
|
* tipc_sk_conn_proto_rcv - receive a connection mng protocol message
|
2014-06-26 08:41:41 +07:00
|
|
|
* @tsk: receiving socket
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
* @skb: pointer to message buffer.
|
2014-06-26 08:41:41 +07:00
|
|
|
*/
|
2017-10-13 16:04:20 +07:00
|
|
|
static void tipc_sk_conn_proto_rcv(struct tipc_sock *tsk, struct sk_buff *skb,
|
2018-10-10 22:50:23 +07:00
|
|
|
struct sk_buff_head *inputq,
|
2017-10-13 16:04:20 +07:00
|
|
|
struct sk_buff_head *xmitq)
|
2014-06-26 08:41:41 +07:00
|
|
|
{
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
struct tipc_msg *hdr = buf_msg(skb);
|
2017-10-13 16:04:20 +07:00
|
|
|
u32 onode = tsk_own_node(tsk);
|
|
|
|
struct sock *sk = &tsk->sk;
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
int mtyp = msg_type(hdr);
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
bool was_cong;
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
|
2014-06-26 08:41:41 +07:00
|
|
|
/* Ignore if connection cannot be validated: */
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
if (!tsk_peer_msg(tsk, hdr)) {
|
|
|
|
trace_tipc_sk_drop_msg(sk, skb, TIPC_DUMP_NONE, "@proto_rcv!");
|
2014-06-26 08:41:41 +07:00
|
|
|
goto exit;
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
}
|
2014-06-26 08:41:41 +07:00
|
|
|
|
2017-04-26 15:05:02 +07:00
|
|
|
if (unlikely(msg_errcode(hdr))) {
|
|
|
|
tipc_set_sk_state(sk, TIPC_DISCONNECTING);
|
|
|
|
tipc_node_remove_conn(sock_net(sk), tsk_peer_node(tsk),
|
|
|
|
tsk_peer_port(tsk));
|
|
|
|
sk->sk_state_change(sk);
|
2018-10-10 22:50:23 +07:00
|
|
|
|
|
|
|
/* State change is ignored if socket already awake,
|
|
|
|
* - convert msg to abort msg and add to inqueue
|
|
|
|
*/
|
|
|
|
msg_set_user(hdr, TIPC_CRITICAL_IMPORTANCE);
|
|
|
|
msg_set_type(hdr, TIPC_CONN_MSG);
|
|
|
|
msg_set_size(hdr, BASIC_H_SIZE);
|
|
|
|
msg_set_hdr_sz(hdr, BASIC_H_SIZE);
|
|
|
|
__skb_queue_tail(inputq, skb);
|
|
|
|
return;
|
2017-04-26 15:05:02 +07:00
|
|
|
}
|
|
|
|
|
2016-11-01 20:02:44 +07:00
|
|
|
tsk->probe_unacked = false;
|
2014-06-26 08:41:41 +07:00
|
|
|
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
if (mtyp == CONN_PROBE) {
|
|
|
|
msg_set_type(hdr, CONN_PROBE_REPLY);
|
2016-06-17 17:35:57 +07:00
|
|
|
if (tipc_msg_reverse(onode, &skb, TIPC_OK))
|
|
|
|
__skb_queue_tail(xmitq, skb);
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
return;
|
|
|
|
} else if (mtyp == CONN_ACK) {
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
was_cong = tsk_conn_cong(tsk);
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
tipc_sk_push_backlog(tsk, msg_nagle_ack(hdr));
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
tsk->snt_unacked -= msg_conn_ack(hdr);
|
|
|
|
if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL)
|
|
|
|
tsk->snd_win = msg_adv_win(hdr);
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
if (was_cong && !tsk_conn_cong(tsk))
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
sk->sk_write_space(sk);
|
|
|
|
} else if (mtyp != CONN_PROBE_REPLY) {
|
|
|
|
pr_warn("Received unknown CONN_PROTO msg\n");
|
2014-06-26 08:41:41 +07:00
|
|
|
}
|
|
|
|
exit:
|
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 21:11:19 +07:00
|
|
|
kfree_skb(skb);
|
2014-06-26 08:41:41 +07:00
|
|
|
}
|
|
|
|
|
2006-01-03 01:04:38 +07:00
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_sendmsg - send message in connectionless manner
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket structure
|
|
|
|
* @m: message to send
|
2014-06-26 08:41:37 +07:00
|
|
|
* @dsz: amount of user data to be sent
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Message must have an destination specified explicitly.
|
2007-02-09 21:25:21 +07:00
|
|
|
* Used for SOCK_RDM and SOCK_DGRAM messages,
|
2006-01-03 01:04:38 +07:00
|
|
|
* and for 'SYN' messages on SOCK_SEQPACKET and SOCK_STREAM connections.
|
|
|
|
* (Note: 'SYN+' is prohibited on SOCK_STREAM.)
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Returns the number of bytes sent on success, or errno otherwise
|
|
|
|
*/
|
2015-03-02 14:37:48 +07:00
|
|
|
static int tipc_sendmsg(struct socket *sock,
|
2014-06-26 08:41:37 +07:00
|
|
|
struct msghdr *m, size_t dsz)
|
2015-03-02 14:37:47 +07:00
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
lock_sock(sk);
|
|
|
|
ret = __tipc_sendmsg(sock, m, dsz);
|
|
|
|
release_sock(sk);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-04-15 14:22:02 +07:00
|
|
|
struct sock *sk = sock->sk;
|
2015-01-09 14:27:05 +07:00
|
|
|
struct net *net = sock_net(sk);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
|
|
|
DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name);
|
|
|
|
long timeout = sock_sndtimeo(sk, m->msg_flags & MSG_DONTWAIT);
|
|
|
|
struct list_head *clinks = &tsk->cong_links;
|
|
|
|
bool syn = !tipc_sk_type_connectionless(sk);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
struct tipc_group *grp = tsk->group;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
struct tipc_msg *hdr = &tsk->phdr;
|
2015-03-19 15:02:19 +07:00
|
|
|
struct tipc_name_seq *seq;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
struct sk_buff_head pkts;
|
tipc: fix retrans failure due to wrong destination
When a user message is sent, TIPC will check if the socket has faced a
congestion at link layer. If that happens, it will make a sleep to wait
for the congestion to disappear. This leaves a gap for other users to
take over the socket (e.g. multi threads) since the socket is released
as well. Also, in case of connectionless (e.g. SOCK_RDM), user is free
to send messages to various destinations (e.g. via 'sendto()'), then
the socket's preformatted header has to be updated correspondingly
prior to the actual payload message building.
Unfortunately, the latter action is done before the first action which
causes a condition issue that the destination of a certain message can
be modified incorrectly in the middle, leading to wrong destination
when that message is built. Consequently, when the message is sent to
the link layer, it gets stuck there forever because the peer node will
simply reject it. After a number of retransmission attempts, the link
is eventually taken down and the retransmission failure is reported.
This commit fixes the problem by rearranging the order of actions to
prevent the race condition from occurring, so the message building is
'atomic' and its header will not be modified by anyone.
Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-10 15:21:04 +07:00
|
|
|
u32 dport = 0, dnode = 0;
|
|
|
|
u32 type = 0, inst = 0;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
int mtu, rc;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
if (unlikely(dlen > TIPC_MAX_USER_MSG_SIZE))
|
2010-04-21 04:58:24 +07:00
|
|
|
return -EMSGSIZE;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
|
2017-10-13 16:04:27 +07:00
|
|
|
if (likely(dest)) {
|
|
|
|
if (unlikely(m->msg_namelen < sizeof(*dest)))
|
|
|
|
return -EINVAL;
|
|
|
|
if (unlikely(dest->family != AF_TIPC))
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (grp) {
|
|
|
|
if (!dest)
|
|
|
|
return tipc_send_group_bcast(sock, m, dlen, timeout);
|
2017-10-13 16:04:28 +07:00
|
|
|
if (dest->addrtype == TIPC_ADDR_NAME)
|
|
|
|
return tipc_send_group_anycast(sock, m, dlen, timeout);
|
2017-10-13 16:04:27 +07:00
|
|
|
if (dest->addrtype == TIPC_ADDR_ID)
|
|
|
|
return tipc_send_group_unicast(sock, m, dlen, timeout);
|
2017-10-13 16:04:29 +07:00
|
|
|
if (dest->addrtype == TIPC_ADDR_MCAST)
|
|
|
|
return tipc_send_group_mcast(sock, m, dlen, timeout);
|
2017-10-13 16:04:27 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
|
2015-03-19 15:02:19 +07:00
|
|
|
if (unlikely(!dest)) {
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
dest = &tsk->peer;
|
2019-03-05 05:26:10 +07:00
|
|
|
if (!syn && dest->family != AF_TIPC)
|
2015-03-19 15:02:19 +07:00
|
|
|
return -EDESTADDRREQ;
|
|
|
|
}
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
|
|
|
|
if (unlikely(syn)) {
|
2016-11-01 20:02:43 +07:00
|
|
|
if (sk->sk_state == TIPC_LISTEN)
|
2015-03-02 14:37:47 +07:00
|
|
|
return -EPIPE;
|
2016-11-01 20:02:45 +07:00
|
|
|
if (sk->sk_state != TIPC_OPEN)
|
2015-03-02 14:37:47 +07:00
|
|
|
return -EISCONN;
|
|
|
|
if (tsk->published)
|
|
|
|
return -EOPNOTSUPP;
|
2006-06-26 13:44:57 +07:00
|
|
|
if (dest->addrtype == TIPC_ADDR_NAME) {
|
2014-08-23 05:09:20 +07:00
|
|
|
tsk->conn_type = dest->addr.name.name.type;
|
|
|
|
tsk->conn_instance = dest->addr.name.name.instance;
|
2006-06-26 13:44:57 +07:00
|
|
|
}
|
2018-09-29 01:23:21 +07:00
|
|
|
msg_set_syn(hdr, 1);
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
2014-06-26 08:41:37 +07:00
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
seq = &dest->addr.nameseq;
|
|
|
|
if (dest->addrtype == TIPC_ADDR_MCAST)
|
|
|
|
return tipc_sendmcast(sock, seq, m, dlen, timeout);
|
2014-06-26 08:41:37 +07:00
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
if (dest->addrtype == TIPC_ADDR_NAME) {
|
|
|
|
type = dest->addr.name.name.type;
|
|
|
|
inst = dest->addr.name.name.instance;
|
2018-03-15 22:48:51 +07:00
|
|
|
dnode = dest->addr.name.domain;
|
2015-01-09 14:27:09 +07:00
|
|
|
dport = tipc_nametbl_translate(net, type, inst, &dnode);
|
2015-03-02 14:37:47 +07:00
|
|
|
if (unlikely(!dport && !dnode))
|
|
|
|
return -EHOSTUNREACH;
|
2014-06-26 08:41:37 +07:00
|
|
|
} else if (dest->addrtype == TIPC_ADDR_ID) {
|
|
|
|
dnode = dest->addr.id.node;
|
2018-04-12 06:15:48 +07:00
|
|
|
} else {
|
|
|
|
return -EINVAL;
|
2014-06-26 08:41:37 +07:00
|
|
|
}
|
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
/* Block or return if destination link is congested */
|
2017-10-13 16:04:22 +07:00
|
|
|
rc = tipc_wait_for_cond(sock, &timeout,
|
|
|
|
!tipc_dest_find(clinks, dnode, 0));
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
if (unlikely(rc))
|
|
|
|
return rc;
|
|
|
|
|
tipc: fix retrans failure due to wrong destination
When a user message is sent, TIPC will check if the socket has faced a
congestion at link layer. If that happens, it will make a sleep to wait
for the congestion to disappear. This leaves a gap for other users to
take over the socket (e.g. multi threads) since the socket is released
as well. Also, in case of connectionless (e.g. SOCK_RDM), user is free
to send messages to various destinations (e.g. via 'sendto()'), then
the socket's preformatted header has to be updated correspondingly
prior to the actual payload message building.
Unfortunately, the latter action is done before the first action which
causes a condition issue that the destination of a certain message can
be modified incorrectly in the middle, leading to wrong destination
when that message is built. Consequently, when the message is sent to
the link layer, it gets stuck there forever because the peer node will
simply reject it. After a number of retransmission attempts, the link
is eventually taken down and the retransmission failure is reported.
This commit fixes the problem by rearranging the order of actions to
prevent the race condition from occurring, so the message building is
'atomic' and its header will not be modified by anyone.
Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-10 15:21:04 +07:00
|
|
|
if (dest->addrtype == TIPC_ADDR_NAME) {
|
|
|
|
msg_set_type(hdr, TIPC_NAMED_MSG);
|
|
|
|
msg_set_hdr_sz(hdr, NAMED_H_SIZE);
|
|
|
|
msg_set_nametype(hdr, type);
|
|
|
|
msg_set_nameinst(hdr, inst);
|
|
|
|
msg_set_lookup_scope(hdr, tipc_node2scope(dnode));
|
|
|
|
msg_set_destnode(hdr, dnode);
|
|
|
|
msg_set_destport(hdr, dport);
|
|
|
|
} else { /* TIPC_ADDR_ID */
|
|
|
|
msg_set_type(hdr, TIPC_DIRECT_MSG);
|
|
|
|
msg_set_lookup_scope(hdr, 0);
|
|
|
|
msg_set_destnode(hdr, dnode);
|
|
|
|
msg_set_destport(hdr, dest->addr.id.ref);
|
|
|
|
msg_set_hdr_sz(hdr, BASIC_H_SIZE);
|
|
|
|
}
|
|
|
|
|
2019-08-15 21:42:50 +07:00
|
|
|
__skb_queue_head_init(&pkts);
|
2020-03-26 09:50:29 +07:00
|
|
|
mtu = tipc_node_get_mtu(net, dnode, tsk->portid, true);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts);
|
|
|
|
if (unlikely(rc != dlen))
|
2015-03-02 14:37:47 +07:00
|
|
|
return rc;
|
2019-11-28 10:10:05 +07:00
|
|
|
if (unlikely(syn && !tipc_msg_skb_clone(&pkts, &sk->sk_write_queue))) {
|
|
|
|
__skb_queue_purge(&pkts);
|
2018-09-29 01:23:22 +07:00
|
|
|
return -ENOMEM;
|
2019-11-28 10:10:05 +07:00
|
|
|
}
|
2014-06-26 08:41:37 +07:00
|
|
|
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_sendmsg(sk, skb_peek(&pkts), TIPC_DUMP_SK_SNDQ, " ");
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
rc = tipc_node_xmit(net, &pkts, dnode, tsk->portid);
|
|
|
|
if (unlikely(rc == -ELINKCONG)) {
|
2017-10-13 16:04:22 +07:00
|
|
|
tipc_dest_push(clinks, dnode, 0);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
tsk->cong_link_cnt++;
|
|
|
|
rc = 0;
|
|
|
|
}
|
2014-06-26 08:41:37 +07:00
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
if (unlikely(syn && !rc))
|
|
|
|
tipc_set_sk_state(sk, TIPC_CONNECTING);
|
|
|
|
|
|
|
|
return rc ? rc : dlen;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
/**
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
* tipc_sendstream - send stream-oriented data
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket structure
|
2014-06-26 08:41:38 +07:00
|
|
|
* @m: data to send
|
|
|
|
* @dsz: total length of data to be transmitted
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2014-06-26 08:41:38 +07:00
|
|
|
* Used for SOCK_STREAM data.
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2014-06-26 08:41:38 +07:00
|
|
|
* Returns the number of bytes sent on success (or partial success),
|
|
|
|
* or errno if no data sent
|
2006-01-03 01:04:38 +07:00
|
|
|
*/
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
static int tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dsz)
|
2015-03-02 14:37:47 +07:00
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
lock_sock(sk);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
ret = __tipc_sendstream(sock, m, dsz);
|
2015-03-02 14:37:47 +07:00
|
|
|
release_sock(sk);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dlen)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-04-15 14:22:02 +07:00
|
|
|
struct sock *sk = sock->sk;
|
2014-01-18 04:53:15 +07:00
|
|
|
DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
long timeout = sock_sndtimeo(sk, m->msg_flags & MSG_DONTWAIT);
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
struct sk_buff_head *txq = &sk->sk_write_queue;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
|
|
|
struct tipc_msg *hdr = &tsk->phdr;
|
|
|
|
struct net *net = sock_net(sk);
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
struct sk_buff *skb;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
u32 dnode = tsk_peer_node(tsk);
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
int maxnagle = tsk->maxnagle;
|
|
|
|
int maxpkt = tsk->max_pkt;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
int send, sent = 0;
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
int blocks, rc = 0;
|
2016-11-01 20:02:34 +07:00
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
if (unlikely(dlen > INT_MAX))
|
|
|
|
return -EMSGSIZE;
|
2014-06-26 08:41:38 +07:00
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
/* Handle implicit connection setup */
|
|
|
|
if (unlikely(dest)) {
|
|
|
|
rc = __tipc_sendmsg(sock, m, dlen);
|
2018-09-25 23:21:58 +07:00
|
|
|
if (dlen && dlen == rc) {
|
|
|
|
tsk->peer_caps = tipc_node_get_capabilities(net, dnode);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
tsk->snt_unacked = tsk_inc(tsk, dlen + msg_hdr_sz(hdr));
|
2018-09-25 23:21:58 +07:00
|
|
|
}
|
2015-03-02 14:37:47 +07:00
|
|
|
return rc;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
}
|
2016-03-01 17:07:09 +07:00
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
do {
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
rc = tipc_wait_for_cond(sock, &timeout,
|
|
|
|
(!tsk->cong_link_cnt &&
|
2017-01-03 22:55:09 +07:00
|
|
|
!tsk_conn_cong(tsk) &&
|
|
|
|
tipc_sk_connected(sk)));
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
if (unlikely(rc))
|
|
|
|
break;
|
|
|
|
send = min_t(size_t, dlen - sent, TIPC_MAX_USER_MSG_SIZE);
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
blocks = tsk->snd_backlog;
|
2020-06-11 17:07:35 +07:00
|
|
|
if (tsk->oneway++ >= tsk->nagle_start && maxnagle &&
|
|
|
|
send <= maxnagle) {
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
rc = tipc_msg_append(hdr, m, send, maxnagle, txq);
|
|
|
|
if (unlikely(rc < 0))
|
|
|
|
break;
|
|
|
|
blocks += rc;
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
tsk->msg_acc++;
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
if (blocks <= 64 && tsk->expect_ack) {
|
|
|
|
tsk->snd_backlog = blocks;
|
|
|
|
sent += send;
|
|
|
|
break;
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
} else if (blocks > 64) {
|
|
|
|
tsk->pkt_cnt += skb_queue_len(txq);
|
|
|
|
} else {
|
|
|
|
skb = skb_peek_tail(txq);
|
2020-05-28 21:34:07 +07:00
|
|
|
if (skb) {
|
|
|
|
msg_set_ack_required(buf_msg(skb));
|
|
|
|
tsk->expect_ack = true;
|
|
|
|
} else {
|
|
|
|
tsk->expect_ack = false;
|
|
|
|
}
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
tsk->msg_acc = 0;
|
|
|
|
tsk->pkt_cnt = 0;
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
rc = tipc_msg_build(hdr, m, sent, send, maxpkt, txq);
|
|
|
|
if (unlikely(rc != send))
|
|
|
|
break;
|
|
|
|
blocks += tsk_inc(tsk, send + MIN_H_SIZE);
|
|
|
|
}
|
|
|
|
trace_tipc_sk_sendstream(sk, skb_peek(txq),
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
TIPC_DUMP_SK_SNDQ, " ");
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
rc = tipc_node_xmit(net, txq, dnode, tsk->portid);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
if (unlikely(rc == -ELINKCONG)) {
|
|
|
|
tsk->cong_link_cnt = 1;
|
|
|
|
rc = 0;
|
|
|
|
}
|
|
|
|
if (likely(!rc)) {
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
tsk->snt_unacked += blocks;
|
|
|
|
tsk->snd_backlog = 0;
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
sent += send;
|
|
|
|
}
|
|
|
|
} while (sent < dlen && !rc);
|
2015-03-02 14:37:47 +07:00
|
|
|
|
2017-04-24 20:00:42 +07:00
|
|
|
return sent ? sent : rc;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
/**
|
2014-06-26 08:41:38 +07:00
|
|
|
* tipc_send_packet - send a connection-oriented message
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket structure
|
2014-06-26 08:41:38 +07:00
|
|
|
* @m: message to send
|
|
|
|
* @dsz: length of data to be transmitted
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2014-06-26 08:41:38 +07:00
|
|
|
* Used for SOCK_SEQPACKET messages.
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2014-06-26 08:41:38 +07:00
|
|
|
* Returns the number of bytes sent on success, or errno otherwise
|
2006-01-03 01:04:38 +07:00
|
|
|
*/
|
2015-03-02 14:37:48 +07:00
|
|
|
static int tipc_send_packet(struct socket *sock, struct msghdr *m, size_t dsz)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2014-06-26 08:41:38 +07:00
|
|
|
if (dsz > TIPC_MAX_USER_MSG_SIZE)
|
|
|
|
return -EMSGSIZE;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
return tipc_sendstream(sock, m, dsz);
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:11 +07:00
|
|
|
/* tipc_sk_finish_conn - complete the setup of a connection
|
2006-01-03 01:04:38 +07:00
|
|
|
*/
|
2014-08-23 05:09:20 +07:00
|
|
|
static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port,
|
2014-08-23 05:09:11 +07:00
|
|
|
u32 peer_node)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2015-01-13 16:07:48 +07:00
|
|
|
struct sock *sk = &tsk->sk;
|
|
|
|
struct net *net = sock_net(sk);
|
2014-08-23 05:09:20 +07:00
|
|
|
struct tipc_msg *msg = &tsk->phdr;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2018-09-29 01:23:21 +07:00
|
|
|
msg_set_syn(msg, 0);
|
2014-08-23 05:09:11 +07:00
|
|
|
msg_set_destnode(msg, peer_node);
|
|
|
|
msg_set_destport(msg, peer_port);
|
|
|
|
msg_set_type(msg, TIPC_CONN_MSG);
|
|
|
|
msg_set_lookup_scope(msg, 0);
|
|
|
|
msg_set_hdr_sz(msg, SHORT_H_SIZE);
|
tipc: introduce non-blocking socket connect
TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.
With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.
The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.
It is also now possible to call select() or poll() to wait for the
completion of a connection.
An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-11-30 06:51:19 +07:00
|
|
|
|
2017-10-20 16:21:32 +07:00
|
|
|
sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV);
|
2016-11-01 20:02:44 +07:00
|
|
|
tipc_set_sk_state(sk, TIPC_ESTABLISHED);
|
2015-01-09 14:27:05 +07:00
|
|
|
tipc_node_add_conn(net, peer_node, tsk->portid, peer_port);
|
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
|
|
|
tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true);
|
2016-05-02 22:58:46 +07:00
|
|
|
tsk->peer_caps = tipc_node_get_capabilities(net, peer_node);
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
tsk_set_nagle(tsk);
|
2018-09-29 01:23:22 +07:00
|
|
|
__skb_queue_purge(&sk->sk_write_queue);
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Fall back to message based flow control */
|
|
|
|
tsk->rcv_win = FLOWCTL_MSG_WIN;
|
|
|
|
tsk->snd_win = FLOWCTL_MSG_WIN;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2017-10-13 16:04:24 +07:00
|
|
|
* tipc_sk_set_orig_addr - capture sender's address for received message
|
2006-01-03 01:04:38 +07:00
|
|
|
* @m: descriptor for message info
|
2020-07-13 06:15:14 +07:00
|
|
|
* @skb: received message
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Note: Address is not captured if not requested by receiver.
|
|
|
|
*/
|
2017-10-13 16:04:24 +07:00
|
|
|
static void tipc_sk_set_orig_addr(struct msghdr *m, struct sk_buff *skb)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2017-10-13 16:04:24 +07:00
|
|
|
DECLARE_SOCKADDR(struct sockaddr_pair *, srcaddr, m->msg_name);
|
|
|
|
struct tipc_msg *hdr = buf_msg(skb);
|
|
|
|
|
|
|
|
if (!srcaddr)
|
|
|
|
return;
|
|
|
|
|
|
|
|
srcaddr->sock.family = AF_TIPC;
|
|
|
|
srcaddr->sock.addrtype = TIPC_ADDR_ID;
|
2018-05-09 23:50:22 +07:00
|
|
|
srcaddr->sock.scope = 0;
|
2017-10-13 16:04:24 +07:00
|
|
|
srcaddr->sock.addr.id.ref = msg_origport(hdr);
|
|
|
|
srcaddr->sock.addr.id.node = msg_orignode(hdr);
|
|
|
|
srcaddr->sock.addr.name.domain = 0;
|
|
|
|
m->msg_namelen = sizeof(struct sockaddr_tipc);
|
|
|
|
|
|
|
|
if (!msg_in_group(hdr))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Group message users may also want to know sending member's id */
|
|
|
|
srcaddr->member.family = AF_TIPC;
|
|
|
|
srcaddr->member.addrtype = TIPC_ADDR_NAME;
|
2018-05-09 23:50:22 +07:00
|
|
|
srcaddr->member.scope = 0;
|
2017-10-13 16:04:24 +07:00
|
|
|
srcaddr->member.addr.name.name.type = msg_nametype(hdr);
|
|
|
|
srcaddr->member.addr.name.name.instance = TIPC_SKB_CB(skb)->orig_member;
|
|
|
|
srcaddr->member.addr.name.domain = 0;
|
|
|
|
m->msg_namelen = sizeof(*srcaddr);
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2014-08-23 05:09:20 +07:00
|
|
|
* tipc_sk_anc_data_recv - optionally capture ancillary data for received message
|
2006-01-03 01:04:38 +07:00
|
|
|
* @m: descriptor for message info
|
2018-11-18 00:17:06 +07:00
|
|
|
* @skb: received message buffer
|
2014-08-23 05:09:20 +07:00
|
|
|
* @tsk: TIPC port associated with message
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Note: Ancillary data is not captured if not requested by receiver.
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Returns 0 if successful, otherwise errno
|
|
|
|
*/
|
2018-11-18 00:17:06 +07:00
|
|
|
static int tipc_sk_anc_data_recv(struct msghdr *m, struct sk_buff *skb,
|
2014-08-23 05:09:20 +07:00
|
|
|
struct tipc_sock *tsk)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2018-11-18 00:17:06 +07:00
|
|
|
struct tipc_msg *msg;
|
2006-01-03 01:04:38 +07:00
|
|
|
u32 anc_data[3];
|
|
|
|
u32 err;
|
|
|
|
u32 dest_type;
|
2006-06-26 13:45:24 +07:00
|
|
|
int has_name;
|
2006-01-03 01:04:38 +07:00
|
|
|
int res;
|
|
|
|
|
|
|
|
if (likely(m->msg_controllen == 0))
|
|
|
|
return 0;
|
2018-11-18 00:17:06 +07:00
|
|
|
msg = buf_msg(skb);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
|
|
|
/* Optionally capture errored message object(s) */
|
|
|
|
err = msg ? msg_errcode(msg) : 0;
|
|
|
|
if (unlikely(err)) {
|
|
|
|
anc_data[0] = err;
|
|
|
|
anc_data[1] = msg_data_sz(msg);
|
2011-01-01 01:59:33 +07:00
|
|
|
res = put_cmsg(m, SOL_TIPC, TIPC_ERRINFO, 8, anc_data);
|
|
|
|
if (res)
|
2006-01-03 01:04:38 +07:00
|
|
|
return res;
|
2011-01-01 01:59:33 +07:00
|
|
|
if (anc_data[1]) {
|
2018-11-18 00:17:06 +07:00
|
|
|
if (skb_linearize(skb))
|
|
|
|
return -ENOMEM;
|
|
|
|
msg = buf_msg(skb);
|
2011-01-01 01:59:33 +07:00
|
|
|
res = put_cmsg(m, SOL_TIPC, TIPC_RETDATA, anc_data[1],
|
|
|
|
msg_data(msg));
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
}
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Optionally capture message destination object */
|
|
|
|
dest_type = msg ? msg_type(msg) : TIPC_DIRECT_MSG;
|
|
|
|
switch (dest_type) {
|
|
|
|
case TIPC_NAMED_MSG:
|
2006-06-26 13:45:24 +07:00
|
|
|
has_name = 1;
|
2006-01-03 01:04:38 +07:00
|
|
|
anc_data[0] = msg_nametype(msg);
|
|
|
|
anc_data[1] = msg_namelower(msg);
|
|
|
|
anc_data[2] = msg_namelower(msg);
|
|
|
|
break;
|
|
|
|
case TIPC_MCAST_MSG:
|
2006-06-26 13:45:24 +07:00
|
|
|
has_name = 1;
|
2006-01-03 01:04:38 +07:00
|
|
|
anc_data[0] = msg_nametype(msg);
|
|
|
|
anc_data[1] = msg_namelower(msg);
|
|
|
|
anc_data[2] = msg_nameupper(msg);
|
|
|
|
break;
|
|
|
|
case TIPC_CONN_MSG:
|
2014-08-23 05:09:20 +07:00
|
|
|
has_name = (tsk->conn_type != 0);
|
|
|
|
anc_data[0] = tsk->conn_type;
|
|
|
|
anc_data[1] = tsk->conn_instance;
|
|
|
|
anc_data[2] = tsk->conn_instance;
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
|
|
|
default:
|
2006-06-26 13:45:24 +07:00
|
|
|
has_name = 0;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
2011-01-01 01:59:33 +07:00
|
|
|
if (has_name) {
|
|
|
|
res = put_cmsg(m, SOL_TIPC, TIPC_DESTNAME, 12, anc_data);
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
}
|
2006-01-03 01:04:38 +07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-05-13 19:33:16 +07:00
|
|
|
static struct sk_buff *tipc_sk_build_ack(struct tipc_sock *tsk)
|
2014-08-23 05:09:12 +07:00
|
|
|
{
|
2016-11-01 20:02:40 +07:00
|
|
|
struct sock *sk = &tsk->sk;
|
2014-11-26 10:41:55 +07:00
|
|
|
struct sk_buff *skb = NULL;
|
2014-08-23 05:09:12 +07:00
|
|
|
struct tipc_msg *msg;
|
2014-08-23 05:09:20 +07:00
|
|
|
u32 peer_port = tsk_peer_port(tsk);
|
|
|
|
u32 dnode = tsk_peer_node(tsk);
|
2014-08-23 05:09:12 +07:00
|
|
|
|
2016-11-01 20:02:40 +07:00
|
|
|
if (!tipc_sk_connected(sk))
|
2020-05-13 19:33:16 +07:00
|
|
|
return NULL;
|
2015-02-05 20:36:36 +07:00
|
|
|
skb = tipc_msg_create(CONN_MANAGER, CONN_ACK, INT_H_SIZE, 0,
|
|
|
|
dnode, tsk_own_node(tsk), peer_port,
|
|
|
|
tsk->portid, TIPC_OK);
|
2014-11-26 10:41:55 +07:00
|
|
|
if (!skb)
|
2020-05-13 19:33:16 +07:00
|
|
|
return NULL;
|
2014-11-26 10:41:55 +07:00
|
|
|
msg = buf_msg(skb);
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
msg_set_conn_ack(msg, tsk->rcv_unacked);
|
|
|
|
tsk->rcv_unacked = 0;
|
|
|
|
|
|
|
|
/* Adjust to and advertize the correct window limit */
|
|
|
|
if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) {
|
|
|
|
tsk->rcv_win = tsk_adv_blocks(tsk->sk.sk_rcvbuf);
|
|
|
|
msg_set_adv_win(msg, tsk->rcv_win);
|
|
|
|
}
|
2020-05-13 19:33:16 +07:00
|
|
|
return skb;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void tipc_sk_send_ack(struct tipc_sock *tsk)
|
|
|
|
{
|
|
|
|
struct sk_buff *skb;
|
|
|
|
|
|
|
|
skb = tipc_sk_build_ack(tsk);
|
|
|
|
if (!skb)
|
|
|
|
return;
|
|
|
|
|
|
|
|
tipc_node_xmit_skb(sock_net(&tsk->sk), skb, tsk_peer_node(tsk),
|
|
|
|
msg_link_selector(buf_msg(skb)));
|
2014-08-23 05:09:12 +07:00
|
|
|
}
|
|
|
|
|
2014-05-24 02:55:12 +07:00
|
|
|
static int tipc_wait_for_rcvmsg(struct socket *sock, long *timeop)
|
2014-01-17 08:50:07 +07:00
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
2019-02-19 11:20:48 +07:00
|
|
|
DEFINE_WAIT_FUNC(wait, woken_wake_function);
|
2014-05-24 02:55:12 +07:00
|
|
|
long timeo = *timeop;
|
2017-04-26 15:05:01 +07:00
|
|
|
int err = sock_error(sk);
|
|
|
|
|
|
|
|
if (err)
|
|
|
|
return err;
|
2014-01-17 08:50:07 +07:00
|
|
|
|
|
|
|
for (;;) {
|
2014-03-06 20:40:18 +07:00
|
|
|
if (timeo && skb_queue_empty(&sk->sk_receive_queue)) {
|
2016-11-01 20:02:47 +07:00
|
|
|
if (sk->sk_shutdown & RCV_SHUTDOWN) {
|
2014-01-17 08:50:07 +07:00
|
|
|
err = -ENOTCONN;
|
|
|
|
break;
|
|
|
|
}
|
2019-02-19 11:20:48 +07:00
|
|
|
add_wait_queue(sk_sleep(sk), &wait);
|
2014-01-17 08:50:07 +07:00
|
|
|
release_sock(sk);
|
2019-02-19 11:20:48 +07:00
|
|
|
timeo = wait_woken(&wait, TASK_INTERRUPTIBLE, timeo);
|
|
|
|
sched_annotate_sleep();
|
2014-01-17 08:50:07 +07:00
|
|
|
lock_sock(sk);
|
2019-02-19 11:20:48 +07:00
|
|
|
remove_wait_queue(sk_sleep(sk), &wait);
|
2014-01-17 08:50:07 +07:00
|
|
|
}
|
|
|
|
err = 0;
|
|
|
|
if (!skb_queue_empty(&sk->sk_receive_queue))
|
|
|
|
break;
|
|
|
|
err = -EAGAIN;
|
|
|
|
if (!timeo)
|
|
|
|
break;
|
2015-03-09 16:43:42 +07:00
|
|
|
err = sock_intr_errno(timeo);
|
|
|
|
if (signal_pending(current))
|
|
|
|
break;
|
2017-04-26 15:05:01 +07:00
|
|
|
|
|
|
|
err = sock_error(sk);
|
|
|
|
if (err)
|
|
|
|
break;
|
2014-01-17 08:50:07 +07:00
|
|
|
}
|
2014-05-24 02:55:12 +07:00
|
|
|
*timeop = timeo;
|
2014-01-17 08:50:07 +07:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_recvmsg - receive packet-oriented message
|
2006-01-03 01:04:38 +07:00
|
|
|
* @m: descriptor for message info
|
2017-05-02 23:16:53 +07:00
|
|
|
* @buflen: length of user buffer area
|
2006-01-03 01:04:38 +07:00
|
|
|
* @flags: receive flags
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Used for SOCK_DGRAM, SOCK_RDM, and SOCK_SEQPACKET messages.
|
|
|
|
* If the complete message doesn't fit in user area, truncate it.
|
|
|
|
*
|
|
|
|
* Returns size of returned message data, errno otherwise
|
|
|
|
*/
|
2017-05-02 23:16:53 +07:00
|
|
|
static int tipc_recvmsg(struct socket *sock, struct msghdr *m,
|
|
|
|
size_t buflen, int flags)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-04-15 14:22:02 +07:00
|
|
|
struct sock *sk = sock->sk;
|
2017-05-02 23:16:53 +07:00
|
|
|
bool connected = !tipc_sk_type_connectionless(sk);
|
2017-10-13 16:04:25 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
2017-05-02 23:16:53 +07:00
|
|
|
int rc, err, hlen, dlen, copy;
|
2017-10-13 16:04:26 +07:00
|
|
|
struct sk_buff_head xmitq;
|
2017-10-13 16:04:25 +07:00
|
|
|
struct tipc_msg *hdr;
|
|
|
|
struct sk_buff *skb;
|
|
|
|
bool grp_evt;
|
2017-05-02 23:16:53 +07:00
|
|
|
long timeout;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
/* Catch invalid receive requests */
|
2017-05-02 23:16:53 +07:00
|
|
|
if (unlikely(!buflen))
|
2006-01-03 01:04:38 +07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
lock_sock(sk);
|
2017-05-02 23:16:53 +07:00
|
|
|
if (unlikely(connected && sk->sk_state == TIPC_OPEN)) {
|
|
|
|
rc = -ENOTCONN;
|
2006-01-03 01:04:38 +07:00
|
|
|
goto exit;
|
|
|
|
}
|
2017-05-02 23:16:53 +07:00
|
|
|
timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-10-13 16:04:26 +07:00
|
|
|
/* Step rcv queue to first msg with data or error; wait if necessary */
|
2017-05-02 23:16:53 +07:00
|
|
|
do {
|
|
|
|
rc = tipc_wait_for_rcvmsg(sock, &timeout);
|
|
|
|
if (unlikely(rc))
|
|
|
|
goto exit;
|
|
|
|
skb = skb_peek(&sk->sk_receive_queue);
|
|
|
|
hdr = buf_msg(skb);
|
|
|
|
dlen = msg_data_sz(hdr);
|
|
|
|
hlen = msg_hdr_sz(hdr);
|
|
|
|
err = msg_errcode(hdr);
|
2017-10-13 16:04:25 +07:00
|
|
|
grp_evt = msg_is_grp_evt(hdr);
|
2017-05-02 23:16:53 +07:00
|
|
|
if (likely(dlen || err))
|
|
|
|
break;
|
2014-08-23 05:09:18 +07:00
|
|
|
tsk_advance_rx_queue(sk);
|
2017-05-02 23:16:53 +07:00
|
|
|
} while (1);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-05-02 23:16:53 +07:00
|
|
|
/* Collect msg meta data, including error code and rejected data */
|
2017-10-13 16:04:24 +07:00
|
|
|
tipc_sk_set_orig_addr(m, skb);
|
2018-11-18 00:17:06 +07:00
|
|
|
rc = tipc_sk_anc_data_recv(m, skb, tsk);
|
2017-05-02 23:16:53 +07:00
|
|
|
if (unlikely(rc))
|
2006-01-03 01:04:38 +07:00
|
|
|
goto exit;
|
2018-11-18 00:17:06 +07:00
|
|
|
hdr = buf_msg(skb);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-05-02 23:16:53 +07:00
|
|
|
/* Capture data if non-error msg, otherwise just set return value */
|
|
|
|
if (likely(!err)) {
|
|
|
|
copy = min_t(int, dlen, buflen);
|
|
|
|
if (unlikely(copy != dlen))
|
2006-01-03 01:04:38 +07:00
|
|
|
m->msg_flags |= MSG_TRUNC;
|
2017-05-02 23:16:53 +07:00
|
|
|
rc = skb_copy_datagram_msg(skb, hlen, m, copy);
|
2006-01-03 01:04:38 +07:00
|
|
|
} else {
|
2017-05-02 23:16:53 +07:00
|
|
|
copy = 0;
|
|
|
|
rc = 0;
|
|
|
|
if (err != TIPC_CONN_SHUTDOWN && connected && !m->msg_control)
|
|
|
|
rc = -ECONNRESET;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
2017-05-02 23:16:53 +07:00
|
|
|
if (unlikely(rc))
|
|
|
|
goto exit;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-10-13 16:04:25 +07:00
|
|
|
/* Mark message as group event if applicable */
|
|
|
|
if (unlikely(grp_evt)) {
|
|
|
|
if (msg_grp_evt(hdr) == TIPC_WITHDRAWN)
|
|
|
|
m->msg_flags |= MSG_EOR;
|
|
|
|
m->msg_flags |= MSG_OOB;
|
|
|
|
copy = 0;
|
|
|
|
}
|
|
|
|
|
2017-05-02 23:16:53 +07:00
|
|
|
/* Caption of data or error code/rejected data was successful */
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
if (unlikely(flags & MSG_PEEK))
|
|
|
|
goto exit;
|
|
|
|
|
2017-10-13 16:04:26 +07:00
|
|
|
/* Send group flow control advertisement when applicable */
|
|
|
|
if (tsk->group && msg_in_group(hdr) && !grp_evt) {
|
2019-08-15 21:42:50 +07:00
|
|
|
__skb_queue_head_init(&xmitq);
|
2017-10-13 16:04:26 +07:00
|
|
|
tipc_group_update_rcv_win(tsk->group, tsk_blocks(hlen + dlen),
|
|
|
|
msg_orignode(hdr), msg_origport(hdr),
|
|
|
|
&xmitq);
|
|
|
|
tipc_node_distr_xmit(sock_net(sk), &xmitq);
|
|
|
|
}
|
|
|
|
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
tsk_advance_rx_queue(sk);
|
2017-10-13 16:04:25 +07:00
|
|
|
|
2017-05-02 23:16:53 +07:00
|
|
|
if (likely(!connected))
|
|
|
|
goto exit;
|
|
|
|
|
2017-10-13 16:04:26 +07:00
|
|
|
/* Send connection flow control advertisement when applicable */
|
2017-05-02 23:16:53 +07:00
|
|
|
tsk->rcv_unacked += tsk_inc(tsk, hlen + dlen);
|
|
|
|
if (tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE)
|
|
|
|
tipc_sk_send_ack(tsk);
|
2006-01-03 01:04:38 +07:00
|
|
|
exit:
|
2008-04-15 14:22:02 +07:00
|
|
|
release_sock(sk);
|
2017-05-02 23:16:53 +07:00
|
|
|
return rc ? rc : copy;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
/**
|
2017-05-02 23:16:54 +07:00
|
|
|
* tipc_recvstream - receive stream-oriented data
|
2006-01-03 01:04:38 +07:00
|
|
|
* @m: descriptor for message info
|
2017-05-02 23:16:54 +07:00
|
|
|
* @buflen: total size of user buffer area
|
2006-01-03 01:04:38 +07:00
|
|
|
* @flags: receive flags
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
|
|
|
* Used for SOCK_STREAM messages only. If not enough data is available
|
2006-01-03 01:04:38 +07:00
|
|
|
* will optionally wait for more; never truncates data.
|
|
|
|
*
|
|
|
|
* Returns size of returned message data, errno otherwise
|
|
|
|
*/
|
2017-05-02 23:16:54 +07:00
|
|
|
static int tipc_recvstream(struct socket *sock, struct msghdr *m,
|
|
|
|
size_t buflen, int flags)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-04-15 14:22:02 +07:00
|
|
|
struct sock *sk = sock->sk;
|
2014-03-12 22:31:12 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
2017-05-02 23:16:54 +07:00
|
|
|
struct sk_buff *skb;
|
|
|
|
struct tipc_msg *hdr;
|
|
|
|
struct tipc_skb_cb *skb_cb;
|
|
|
|
bool peek = flags & MSG_PEEK;
|
|
|
|
int offset, required, copy, copied = 0;
|
|
|
|
int hlen, dlen, err, rc;
|
|
|
|
long timeout;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
/* Catch invalid receive attempts */
|
2017-05-02 23:16:54 +07:00
|
|
|
if (unlikely(!buflen))
|
2006-01-03 01:04:38 +07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
lock_sock(sk);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2016-11-01 20:02:45 +07:00
|
|
|
if (unlikely(sk->sk_state == TIPC_OPEN)) {
|
2017-05-02 23:16:54 +07:00
|
|
|
rc = -ENOTCONN;
|
2014-01-17 08:50:07 +07:00
|
|
|
goto exit;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
2017-05-02 23:16:54 +07:00
|
|
|
required = sock_rcvlowat(sk, flags & MSG_WAITALL, buflen);
|
|
|
|
timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-05-02 23:16:54 +07:00
|
|
|
do {
|
|
|
|
/* Look at first msg in receive queue; wait if necessary */
|
|
|
|
rc = tipc_wait_for_rcvmsg(sock, &timeout);
|
|
|
|
if (unlikely(rc))
|
|
|
|
break;
|
|
|
|
skb = skb_peek(&sk->sk_receive_queue);
|
|
|
|
skb_cb = TIPC_SKB_CB(skb);
|
|
|
|
hdr = buf_msg(skb);
|
|
|
|
dlen = msg_data_sz(hdr);
|
|
|
|
hlen = msg_hdr_sz(hdr);
|
|
|
|
err = msg_errcode(hdr);
|
2011-02-21 21:45:40 +07:00
|
|
|
|
2017-05-02 23:16:54 +07:00
|
|
|
/* Discard any empty non-errored (SYN-) message */
|
|
|
|
if (unlikely(!dlen && !err)) {
|
|
|
|
tsk_advance_rx_queue(sk);
|
|
|
|
continue;
|
|
|
|
}
|
2011-02-21 21:45:40 +07:00
|
|
|
|
2017-05-02 23:16:54 +07:00
|
|
|
/* Collect msg meta data, incl. error code and rejected data */
|
|
|
|
if (!copied) {
|
2017-10-13 16:04:24 +07:00
|
|
|
tipc_sk_set_orig_addr(m, skb);
|
2018-11-18 00:17:06 +07:00
|
|
|
rc = tipc_sk_anc_data_recv(m, skb, tsk);
|
2017-05-02 23:16:54 +07:00
|
|
|
if (rc)
|
|
|
|
break;
|
2018-11-18 00:17:06 +07:00
|
|
|
hdr = buf_msg(skb);
|
2017-05-02 23:16:54 +07:00
|
|
|
}
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-05-02 23:16:54 +07:00
|
|
|
/* Copy data if msg ok, otherwise return error/partial data */
|
|
|
|
if (likely(!err)) {
|
|
|
|
offset = skb_cb->bytes_read;
|
|
|
|
copy = min_t(int, dlen - offset, buflen - copied);
|
|
|
|
rc = skb_copy_datagram_msg(skb, hlen + offset, m, copy);
|
|
|
|
if (unlikely(rc))
|
|
|
|
break;
|
|
|
|
copied += copy;
|
|
|
|
offset += copy;
|
|
|
|
if (unlikely(offset < dlen)) {
|
|
|
|
if (!peek)
|
|
|
|
skb_cb->bytes_read = offset;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
rc = 0;
|
|
|
|
if ((err != TIPC_CONN_SHUTDOWN) && !m->msg_control)
|
|
|
|
rc = -ECONNRESET;
|
|
|
|
if (copied || rc)
|
|
|
|
break;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2017-05-02 23:16:54 +07:00
|
|
|
if (unlikely(peek))
|
|
|
|
break;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-05-02 23:16:54 +07:00
|
|
|
tsk_advance_rx_queue(sk);
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
|
2017-05-02 23:16:54 +07:00
|
|
|
/* Send connection flow control advertisement when applicable */
|
|
|
|
tsk->rcv_unacked += tsk_inc(tsk, hlen + dlen);
|
2020-05-13 19:33:16 +07:00
|
|
|
if (tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE)
|
2017-05-02 23:16:54 +07:00
|
|
|
tipc_sk_send_ack(tsk);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-05-02 23:16:54 +07:00
|
|
|
/* Exit if all requested data or FIN/error received */
|
|
|
|
if (copied == buflen || err)
|
|
|
|
break;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-05-02 23:16:54 +07:00
|
|
|
} while (!skb_queue_empty(&sk->sk_receive_queue) || copied < required);
|
2006-01-03 01:04:38 +07:00
|
|
|
exit:
|
2008-04-15 14:22:02 +07:00
|
|
|
release_sock(sk);
|
2017-05-02 23:16:54 +07:00
|
|
|
return copied ? copied : rc;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2012-08-21 10:16:57 +07:00
|
|
|
/**
|
|
|
|
* tipc_write_space - wake up thread if port congestion is released
|
|
|
|
* @sk: socket
|
|
|
|
*/
|
|
|
|
static void tipc_write_space(struct sock *sk)
|
|
|
|
{
|
|
|
|
struct socket_wq *wq;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
wq = rcu_dereference(sk->sk_wq);
|
2015-11-26 12:55:39 +07:00
|
|
|
if (skwq_has_sleeper(wq))
|
2018-02-12 05:34:03 +07:00
|
|
|
wake_up_interruptible_sync_poll(&wq->wait, EPOLLOUT |
|
|
|
|
EPOLLWRNORM | EPOLLWRBAND);
|
2012-08-21 10:16:57 +07:00
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* tipc_data_ready - wake up threads to indicate messages have been received
|
|
|
|
* @sk: socket
|
|
|
|
*/
|
2014-04-12 03:15:36 +07:00
|
|
|
static void tipc_data_ready(struct sock *sk)
|
2012-08-21 10:16:57 +07:00
|
|
|
{
|
|
|
|
struct socket_wq *wq;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
wq = rcu_dereference(sk->sk_wq);
|
2015-11-26 12:55:39 +07:00
|
|
|
if (skwq_has_sleeper(wq))
|
2018-02-12 05:34:03 +07:00
|
|
|
wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN |
|
|
|
|
EPOLLRDNORM | EPOLLRDBAND);
|
2012-08-21 10:16:57 +07:00
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2015-11-22 14:46:05 +07:00
|
|
|
static void tipc_sock_destruct(struct sock *sk)
|
|
|
|
{
|
|
|
|
__skb_queue_purge(&sk->sk_receive_queue);
|
|
|
|
}
|
|
|
|
|
2017-10-13 16:04:20 +07:00
|
|
|
static void tipc_sk_proto_rcv(struct sock *sk,
|
|
|
|
struct sk_buff_head *inputq,
|
|
|
|
struct sk_buff_head *xmitq)
|
|
|
|
{
|
|
|
|
struct sk_buff *skb = __skb_dequeue(inputq);
|
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
|
|
|
struct tipc_msg *hdr = buf_msg(skb);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
struct tipc_group *grp = tsk->group;
|
2017-10-13 16:04:26 +07:00
|
|
|
bool wakeup = false;
|
2017-10-13 16:04:20 +07:00
|
|
|
|
|
|
|
switch (msg_user(hdr)) {
|
|
|
|
case CONN_MANAGER:
|
2018-10-10 22:50:23 +07:00
|
|
|
tipc_sk_conn_proto_rcv(tsk, skb, inputq, xmitq);
|
2017-10-13 16:04:20 +07:00
|
|
|
return;
|
|
|
|
case SOCK_WAKEUP:
|
2017-10-13 16:04:22 +07:00
|
|
|
tipc_dest_del(&tsk->cong_links, msg_orignode(hdr), 0);
|
2019-02-25 10:57:20 +07:00
|
|
|
/* coupled with smp_rmb() in tipc_wait_for_cond() */
|
|
|
|
smp_wmb();
|
2017-10-13 16:04:20 +07:00
|
|
|
tsk->cong_link_cnt--;
|
2017-10-13 16:04:26 +07:00
|
|
|
wakeup = true;
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
tipc_sk_push_backlog(tsk, false);
|
2017-10-13 16:04:20 +07:00
|
|
|
break;
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
case GROUP_PROTOCOL:
|
2017-10-13 16:04:26 +07:00
|
|
|
tipc_group_proto_rcv(grp, &wakeup, hdr, inputq, xmitq);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
break;
|
2017-10-13 16:04:20 +07:00
|
|
|
case TOP_SRV:
|
2017-10-13 16:04:26 +07:00
|
|
|
tipc_group_member_evt(tsk->group, &wakeup, &sk->sk_rcvbuf,
|
2018-01-09 03:03:26 +07:00
|
|
|
hdr, inputq, xmitq);
|
2017-10-13 16:04:20 +07:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2017-10-13 16:04:26 +07:00
|
|
|
if (wakeup)
|
|
|
|
sk->sk_write_space(sk);
|
|
|
|
|
2017-10-13 16:04:20 +07:00
|
|
|
kfree_skb(skb);
|
|
|
|
}
|
|
|
|
|
2012-11-30 06:39:14 +07:00
|
|
|
/**
|
2018-09-29 01:23:20 +07:00
|
|
|
* tipc_sk_filter_connect - check incoming message for a connection-based socket
|
2014-03-12 22:31:12 +07:00
|
|
|
* @tsk: TIPC socket
|
2018-09-29 01:23:20 +07:00
|
|
|
* @skb: pointer to message buffer.
|
2020-05-13 19:33:16 +07:00
|
|
|
* @xmitq: for Nagle ACK if any
|
2018-09-29 01:23:20 +07:00
|
|
|
* Returns true if message should be added to receive queue, false otherwise
|
2012-11-30 06:39:14 +07:00
|
|
|
*/
|
2020-05-13 19:33:16 +07:00
|
|
|
static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb,
|
|
|
|
struct sk_buff_head *xmitq)
|
2012-11-30 06:39:14 +07:00
|
|
|
{
|
2014-03-12 22:31:12 +07:00
|
|
|
struct sock *sk = &tsk->sk;
|
2015-01-09 14:27:05 +07:00
|
|
|
struct net *net = sock_net(sk);
|
2015-07-22 21:11:20 +07:00
|
|
|
struct tipc_msg *hdr = buf_msg(skb);
|
2018-09-29 01:23:20 +07:00
|
|
|
bool con_msg = msg_connected(hdr);
|
|
|
|
u32 pport = tsk_peer_port(tsk);
|
|
|
|
u32 pnode = tsk_peer_node(tsk);
|
|
|
|
u32 oport = msg_origport(hdr);
|
|
|
|
u32 onode = msg_orignode(hdr);
|
|
|
|
int err = msg_errcode(hdr);
|
2018-09-29 01:23:22 +07:00
|
|
|
unsigned long delay;
|
2012-11-30 06:39:14 +07:00
|
|
|
|
2015-07-22 21:11:20 +07:00
|
|
|
if (unlikely(msg_mcast(hdr)))
|
|
|
|
return false;
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
tsk->oneway = 0;
|
2012-11-30 06:39:14 +07:00
|
|
|
|
2016-11-01 20:02:48 +07:00
|
|
|
switch (sk->sk_state) {
|
|
|
|
case TIPC_CONNECTING:
|
2018-09-29 01:23:20 +07:00
|
|
|
/* Setup ACK */
|
|
|
|
if (likely(con_msg)) {
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
tipc_sk_finish_conn(tsk, oport, onode);
|
|
|
|
msg_set_importance(&tsk->phdr, msg_importance(hdr));
|
|
|
|
/* ACK+ message with data is added to receive queue */
|
|
|
|
if (msg_data_sz(hdr))
|
|
|
|
return true;
|
|
|
|
/* Empty ACK-, - wake up sleeping connect() and drop */
|
2019-05-09 12:13:42 +07:00
|
|
|
sk->sk_state_change(sk);
|
2018-09-29 01:23:20 +07:00
|
|
|
msg_set_dest_droppable(hdr, 1);
|
|
|
|
return false;
|
tipc: introduce non-blocking socket connect
TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.
With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.
The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.
It is also now possible to call select() or poll() to wait for the
completion of a connection.
An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-11-30 06:51:19 +07:00
|
|
|
}
|
2018-09-29 01:23:20 +07:00
|
|
|
/* Ignore connectionless message if not from listening socket */
|
|
|
|
if (oport != pport || onode != pnode)
|
|
|
|
return false;
|
tipc: introduce non-blocking socket connect
TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.
With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.
The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.
It is also now possible to call select() or poll() to wait for the
completion of a connection.
An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-11-30 06:51:19 +07:00
|
|
|
|
2018-09-29 01:23:22 +07:00
|
|
|
/* Rejected SYN */
|
|
|
|
if (err != TIPC_ERR_OVERLOAD)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* Prepare for new setup attempt if we have a SYN clone */
|
|
|
|
if (skb_queue_empty(&sk->sk_write_queue))
|
|
|
|
break;
|
|
|
|
get_random_bytes(&delay, 2);
|
|
|
|
delay %= (tsk->conn_timeout / 4);
|
|
|
|
delay = msecs_to_jiffies(delay + 100);
|
|
|
|
sk_reset_timer(sk, &sk->sk_timer, jiffies + delay);
|
|
|
|
return false;
|
2016-11-01 20:02:45 +07:00
|
|
|
case TIPC_OPEN:
|
2016-11-01 20:02:46 +07:00
|
|
|
case TIPC_DISCONNECTING:
|
2018-09-29 01:23:20 +07:00
|
|
|
return false;
|
2016-11-01 20:02:45 +07:00
|
|
|
case TIPC_LISTEN:
|
2012-11-30 06:39:14 +07:00
|
|
|
/* Accept only SYN message */
|
2018-09-29 01:23:21 +07:00
|
|
|
if (!msg_is_syn(hdr) &&
|
|
|
|
tipc_node_get_capabilities(net, onode) & TIPC_SYN_BIT)
|
|
|
|
return false;
|
2018-09-29 01:23:20 +07:00
|
|
|
if (!con_msg && !err)
|
2015-07-22 21:11:20 +07:00
|
|
|
return true;
|
2018-09-29 01:23:20 +07:00
|
|
|
return false;
|
2016-11-01 20:02:49 +07:00
|
|
|
case TIPC_ESTABLISHED:
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
if (!skb_queue_empty(&sk->sk_write_queue))
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
tipc_sk_push_backlog(tsk, false);
|
2016-11-01 20:02:49 +07:00
|
|
|
/* Accept only connection-based messages sent by peer */
|
2020-05-13 19:33:16 +07:00
|
|
|
if (likely(con_msg && !err && pport == oport &&
|
|
|
|
pnode == onode)) {
|
|
|
|
if (msg_ack_required(hdr)) {
|
|
|
|
struct sk_buff *skb;
|
|
|
|
|
|
|
|
skb = tipc_sk_build_ack(tsk);
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
if (skb) {
|
|
|
|
msg_set_nagle_ack(buf_msg(skb));
|
2020-05-13 19:33:16 +07:00
|
|
|
__skb_queue_tail(xmitq, skb);
|
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
|
|
|
}
|
2020-05-13 19:33:16 +07:00
|
|
|
}
|
2018-09-29 01:23:20 +07:00
|
|
|
return true;
|
2020-05-13 19:33:16 +07:00
|
|
|
}
|
2018-09-29 01:23:20 +07:00
|
|
|
if (!tsk_peer_msg(tsk, hdr))
|
2016-11-01 20:02:49 +07:00
|
|
|
return false;
|
2018-09-29 01:23:20 +07:00
|
|
|
if (!err)
|
|
|
|
return true;
|
|
|
|
tipc_set_sk_state(sk, TIPC_DISCONNECTING);
|
|
|
|
tipc_node_remove_conn(net, pnode, tsk->portid);
|
|
|
|
sk->sk_state_change(sk);
|
2016-11-01 20:02:49 +07:00
|
|
|
return true;
|
2012-11-30 06:39:14 +07:00
|
|
|
default:
|
2016-11-01 20:02:45 +07:00
|
|
|
pr_err("Unknown sk_state %u\n", sk->sk_state);
|
2012-11-30 06:39:14 +07:00
|
|
|
}
|
2018-09-29 01:23:20 +07:00
|
|
|
/* Abort connection setup attempt */
|
|
|
|
tipc_set_sk_state(sk, TIPC_DISCONNECTING);
|
|
|
|
sk->sk_err = ECONNREFUSED;
|
|
|
|
sk->sk_state_change(sk);
|
|
|
|
return true;
|
2012-11-30 06:39:14 +07:00
|
|
|
}
|
|
|
|
|
2013-01-21 05:30:09 +07:00
|
|
|
/**
|
|
|
|
* rcvbuf_limit - get proper overload limit of socket receive queue
|
|
|
|
* @sk: socket
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
* @skb: message
|
2013-01-21 05:30:09 +07:00
|
|
|
*
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
* For connection oriented messages, irrespective of importance,
|
|
|
|
* default queue limit is 2 MB.
|
2013-01-21 05:30:09 +07:00
|
|
|
*
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
* For connectionless messages, queue limits are based on message
|
|
|
|
* importance as follows:
|
2013-01-21 05:30:09 +07:00
|
|
|
*
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
* TIPC_LOW_IMPORTANCE (2 MB)
|
|
|
|
* TIPC_MEDIUM_IMPORTANCE (4 MB)
|
|
|
|
* TIPC_HIGH_IMPORTANCE (8 MB)
|
|
|
|
* TIPC_CRITICAL_IMPORTANCE (16 MB)
|
2013-01-21 05:30:09 +07:00
|
|
|
*
|
|
|
|
* Returns overload limit according to corresponding message importance
|
|
|
|
*/
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
static unsigned int rcvbuf_limit(struct sock *sk, struct sk_buff *skb)
|
2013-01-21 05:30:09 +07:00
|
|
|
{
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
|
|
|
struct tipc_msg *hdr = buf_msg(skb);
|
|
|
|
|
2017-10-13 16:04:26 +07:00
|
|
|
if (unlikely(msg_in_group(hdr)))
|
2019-10-10 05:21:13 +07:00
|
|
|
return READ_ONCE(sk->sk_rcvbuf);
|
2017-10-13 16:04:26 +07:00
|
|
|
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
if (unlikely(!msg_connected(hdr)))
|
2019-10-10 05:21:13 +07:00
|
|
|
return READ_ONCE(sk->sk_rcvbuf) << msg_importance(hdr);
|
2013-01-21 05:30:09 +07:00
|
|
|
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
if (likely(tsk->peer_caps & TIPC_BLOCK_FLOWCTL))
|
2019-10-10 05:21:13 +07:00
|
|
|
return READ_ONCE(sk->sk_rcvbuf);
|
2013-12-12 08:36:39 +07:00
|
|
|
|
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:58:47 +07:00
|
|
|
return FLOWCTL_MSG_LIM;
|
2013-01-21 05:30:09 +07:00
|
|
|
}
|
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
/**
|
2017-10-13 16:04:20 +07:00
|
|
|
* tipc_sk_filter_rcv - validate incoming message
|
2008-04-15 14:22:02 +07:00
|
|
|
* @sk: socket
|
2015-07-22 21:11:20 +07:00
|
|
|
* @skb: pointer to message.
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2008-04-15 14:22:02 +07:00
|
|
|
* Enqueues message on receive queue if acceptable; optionally handles
|
|
|
|
* disconnect indication for a connected socket.
|
|
|
|
*
|
2015-02-05 20:36:37 +07:00
|
|
|
* Called with socket lock already taken
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
*/
|
2017-10-13 16:04:20 +07:00
|
|
|
static void tipc_sk_filter_rcv(struct sock *sk, struct sk_buff *skb,
|
|
|
|
struct sk_buff_head *xmitq)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2017-10-13 16:04:20 +07:00
|
|
|
bool sk_conn = !tipc_sk_type_connectionless(sk);
|
2014-03-12 22:31:12 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
struct tipc_group *grp = tsk->group;
|
2015-07-22 21:11:20 +07:00
|
|
|
struct tipc_msg *hdr = buf_msg(skb);
|
2017-10-13 16:04:20 +07:00
|
|
|
struct net *net = sock_net(sk);
|
|
|
|
struct sk_buff_head inputq;
|
2019-03-21 17:25:17 +07:00
|
|
|
int mtyp = msg_type(hdr);
|
2017-10-13 16:04:20 +07:00
|
|
|
int limit, err = TIPC_OK;
|
2014-06-26 08:41:40 +07:00
|
|
|
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_filter_rcv(sk, skb, TIPC_DUMP_ALL, " ");
|
2017-10-13 16:04:20 +07:00
|
|
|
TIPC_SKB_CB(skb)->bytes_read = 0;
|
|
|
|
__skb_queue_head_init(&inputq);
|
|
|
|
__skb_queue_tail(&inputq, skb);
|
2014-08-23 05:09:07 +07:00
|
|
|
|
2017-10-13 16:04:20 +07:00
|
|
|
if (unlikely(!msg_isdata(hdr)))
|
|
|
|
tipc_sk_proto_rcv(sk, &inputq, xmitq);
|
2008-04-15 14:22:02 +07:00
|
|
|
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
if (unlikely(grp))
|
|
|
|
tipc_group_filter_msg(grp, &inputq, xmitq);
|
|
|
|
|
2019-03-21 17:25:17 +07:00
|
|
|
if (unlikely(!grp) && mtyp == TIPC_MCAST_MSG)
|
2019-03-21 17:25:18 +07:00
|
|
|
tipc_mcast_filter_msg(net, &tsk->mc_method.deferredq, &inputq);
|
tipc: smooth change between replicast and broadcast
Currently, a multicast stream may start out using replicast, because
there are few destinations, and then it should ideally switch to
L2/broadcast IGMP/multicast when the number of destinations grows beyond
a certain limit. The opposite should happen when the number decreases
below the limit.
To eliminate the risk of message reordering caused by method change,
a sending socket must stick to a previously selected method until it
enters an idle period of 5 seconds. Means there is a 5 seconds pause
in the traffic from the sender socket.
If the sender never makes such a pause, the method will never change,
and transmission may become very inefficient as the cluster grows.
With this commit, we allow such a switch between replicast and
broadcast without any need for a traffic pause.
Solution is to send a dummy message with only the header, also with
the SYN bit set, via broadcast or replicast. For the data message,
the SYN bit is set and sending via replicast or broadcast (inverse
method with dummy).
Then, at receiving side any messages follow first SYN bit message
(data or dummy message), they will be held in deferred queue until
another pair (dummy or data message) arrived in other link.
v2: reverse christmas tree declaration
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19 18:49:50 +07:00
|
|
|
|
2017-10-13 16:04:20 +07:00
|
|
|
/* Validate and add to receive buffer if there is space */
|
|
|
|
while ((skb = __skb_dequeue(&inputq))) {
|
|
|
|
hdr = buf_msg(skb);
|
|
|
|
limit = rcvbuf_limit(sk, skb);
|
2020-05-13 19:33:16 +07:00
|
|
|
if ((sk_conn && !tipc_sk_filter_connect(tsk, skb, xmitq)) ||
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
(!sk_conn && msg_connected(hdr)) ||
|
|
|
|
(!grp && msg_in_group(hdr)))
|
2015-07-22 21:11:20 +07:00
|
|
|
err = TIPC_ERR_NO_PORT;
|
2018-03-21 20:37:45 +07:00
|
|
|
else if (sk_rmem_alloc_get(sk) + skb->truesize >= limit) {
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_dump(sk, skb, TIPC_DUMP_ALL,
|
|
|
|
"err_overload2!");
|
2018-03-21 20:37:45 +07:00
|
|
|
atomic_inc(&sk->sk_drops);
|
2017-10-13 16:04:20 +07:00
|
|
|
err = TIPC_ERR_OVERLOAD;
|
2018-03-21 20:37:45 +07:00
|
|
|
}
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2017-10-13 16:04:20 +07:00
|
|
|
if (unlikely(err)) {
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
if (tipc_msg_reverse(tipc_own_addr(net), &skb, err)) {
|
|
|
|
trace_tipc_sk_rej_msg(sk, skb, TIPC_DUMP_NONE,
|
|
|
|
"@filter_rcv!");
|
|
|
|
__skb_queue_tail(xmitq, skb);
|
|
|
|
}
|
2017-10-13 16:04:20 +07:00
|
|
|
err = TIPC_OK;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
__skb_queue_tail(&sk->sk_receive_queue, skb);
|
|
|
|
skb_set_owner_r(skb, sk);
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_overlimit2(sk, skb, TIPC_DUMP_ALL,
|
|
|
|
"rcvq >90% allocated!");
|
2017-10-13 16:04:20 +07:00
|
|
|
sk->sk_data_ready(sk);
|
2015-07-22 21:11:20 +07:00
|
|
|
}
|
2008-04-15 14:22:02 +07:00
|
|
|
}
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
/**
|
2017-10-13 16:04:20 +07:00
|
|
|
* tipc_sk_backlog_rcv - handle incoming message from backlog queue
|
2008-04-15 14:22:02 +07:00
|
|
|
* @sk: socket
|
2014-11-26 10:41:55 +07:00
|
|
|
* @skb: message
|
2008-04-15 14:22:02 +07:00
|
|
|
*
|
tipc: split up function tipc_msg_eval()
The function tipc_msg_eval() is in reality doing two related, but
different tasks. First it tries to find a new destination for named
messages, in case there was no first lookup, or if the first lookup
failed. Second, it does what its name suggests, evaluating the validity
of the message and its destination, and returning an appropriate error
code depending on the result.
This is confusing, and in this commit we choose to break it up into two
functions. A new function, tipc_msg_lookup_dest(), first attempts to find
a new destination, if the message is of the right type. If this lookup
fails, or if the message should not be subject to a second lookup, the
already existing tipc_msg_reverse() is called. This function performs
prepares the message for rejection, if applicable.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:39 +07:00
|
|
|
* Caller must hold socket lock
|
2008-04-15 14:22:02 +07:00
|
|
|
*/
|
2017-10-13 16:04:20 +07:00
|
|
|
static int tipc_sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
|
2008-04-15 14:22:02 +07:00
|
|
|
{
|
2017-10-13 16:04:20 +07:00
|
|
|
unsigned int before = sk_rmem_alloc_get(sk);
|
2016-06-17 17:35:57 +07:00
|
|
|
struct sk_buff_head xmitq;
|
2017-10-13 16:04:20 +07:00
|
|
|
unsigned int added;
|
2008-04-15 14:22:02 +07:00
|
|
|
|
2016-06-17 17:35:57 +07:00
|
|
|
__skb_queue_head_init(&xmitq);
|
|
|
|
|
2017-10-13 16:04:20 +07:00
|
|
|
tipc_sk_filter_rcv(sk, skb, &xmitq);
|
|
|
|
added = sk_rmem_alloc_get(sk) - before;
|
|
|
|
atomic_add(added, &tipc_sk(sk)->dupl_rcvcnt);
|
2016-06-17 17:35:57 +07:00
|
|
|
|
2017-10-13 16:04:20 +07:00
|
|
|
/* Send pending response/rejected messages, if any */
|
2017-10-13 16:04:21 +07:00
|
|
|
tipc_node_distr_xmit(sock_net(sk), &xmitq);
|
2008-04-15 14:22:02 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-02-05 20:36:38 +07:00
|
|
|
/**
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
* tipc_sk_enqueue - extract all buffers with destination 'dport' from
|
|
|
|
* inputq and try adding them to socket or backlog queue
|
|
|
|
* @inputq: list of incoming buffers with potentially different destinations
|
|
|
|
* @sk: socket where the buffers should be enqueued
|
|
|
|
* @dport: port number for the socket
|
2015-02-05 20:36:38 +07:00
|
|
|
*
|
|
|
|
* Caller must hold socket lock
|
|
|
|
*/
|
2015-07-22 21:11:20 +07:00
|
|
|
static void tipc_sk_enqueue(struct sk_buff_head *inputq, struct sock *sk,
|
2016-06-17 17:35:57 +07:00
|
|
|
u32 dport, struct sk_buff_head *xmitq)
|
2015-02-05 20:36:38 +07:00
|
|
|
{
|
2016-06-17 17:35:57 +07:00
|
|
|
unsigned long time_limit = jiffies + 2;
|
|
|
|
struct sk_buff *skb;
|
2015-02-05 20:36:38 +07:00
|
|
|
unsigned int lim;
|
|
|
|
atomic_t *dcnt;
|
2016-06-17 17:35:57 +07:00
|
|
|
u32 onode;
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
|
|
|
|
while (skb_queue_len(inputq)) {
|
2015-02-08 23:10:50 +07:00
|
|
|
if (unlikely(time_after_eq(jiffies, time_limit)))
|
2015-07-22 21:11:20 +07:00
|
|
|
return;
|
|
|
|
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
skb = tipc_skb_dequeue(inputq, dport);
|
|
|
|
if (unlikely(!skb))
|
2015-07-22 21:11:20 +07:00
|
|
|
return;
|
|
|
|
|
|
|
|
/* Add message directly to receive queue if possible */
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
if (!sock_owned_by_user(sk)) {
|
2017-10-13 16:04:20 +07:00
|
|
|
tipc_sk_filter_rcv(sk, skb, xmitq);
|
2015-07-22 21:11:20 +07:00
|
|
|
continue;
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
}
|
2015-07-22 21:11:20 +07:00
|
|
|
|
|
|
|
/* Try backlog, compensating for double-counted bytes */
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
dcnt = &tipc_sk(sk)->dupl_rcvcnt;
|
2016-05-02 22:58:45 +07:00
|
|
|
if (!sk->sk_backlog.len)
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
atomic_set(dcnt, 0);
|
|
|
|
lim = rcvbuf_limit(sk, skb) + atomic_read(dcnt);
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
if (likely(!sk_add_backlog(sk, skb, lim))) {
|
|
|
|
trace_tipc_sk_overlimit1(sk, skb, TIPC_DUMP_ALL,
|
|
|
|
"bklg & rcvq >90% allocated!");
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
continue;
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
}
|
2015-07-22 21:11:20 +07:00
|
|
|
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_dump(sk, skb, TIPC_DUMP_ALL, "err_overload!");
|
2015-07-22 21:11:20 +07:00
|
|
|
/* Overload => reject message back to sender */
|
2016-06-17 17:35:57 +07:00
|
|
|
onode = tipc_own_addr(sock_net(sk));
|
2018-03-21 20:37:45 +07:00
|
|
|
atomic_inc(&sk->sk_drops);
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
if (tipc_msg_reverse(onode, &skb, TIPC_ERR_OVERLOAD)) {
|
|
|
|
trace_tipc_sk_rej_msg(sk, skb, TIPC_DUMP_ALL,
|
|
|
|
"@sk_enqueue!");
|
2016-06-17 17:35:57 +07:00
|
|
|
__skb_queue_tail(xmitq, skb);
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
}
|
2015-07-22 21:11:20 +07:00
|
|
|
break;
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
}
|
2015-02-05 20:36:38 +07:00
|
|
|
}
|
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
/**
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
* tipc_sk_rcv - handle a chain of incoming buffers
|
|
|
|
* @inputq: buffer list containing the buffers
|
|
|
|
* Consumes all buffers in list until inputq is empty
|
|
|
|
* Note: may be called in multiple threads referring to the same queue
|
2008-04-15 14:22:02 +07:00
|
|
|
*/
|
2015-07-22 21:11:20 +07:00
|
|
|
void tipc_sk_rcv(struct net *net, struct sk_buff_head *inputq)
|
2008-04-15 14:22:02 +07:00
|
|
|
{
|
2016-06-17 17:35:57 +07:00
|
|
|
struct sk_buff_head xmitq;
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
u32 dnode, dport = 0;
|
2015-04-23 20:37:39 +07:00
|
|
|
int err;
|
2014-05-14 16:39:15 +07:00
|
|
|
struct tipc_sock *tsk;
|
|
|
|
struct sock *sk;
|
2015-07-22 21:11:20 +07:00
|
|
|
struct sk_buff *skb;
|
2014-05-14 16:39:15 +07:00
|
|
|
|
2016-06-17 17:35:57 +07:00
|
|
|
__skb_queue_head_init(&xmitq);
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
while (skb_queue_len(inputq)) {
|
|
|
|
dport = tipc_skb_peek_port(inputq, dport);
|
|
|
|
tsk = tipc_sk_lookup(net, dport);
|
2015-07-22 21:11:20 +07:00
|
|
|
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
if (likely(tsk)) {
|
|
|
|
sk = &tsk->sk;
|
|
|
|
if (likely(spin_trylock_bh(&sk->sk_lock.slock))) {
|
2016-06-17 17:35:57 +07:00
|
|
|
tipc_sk_enqueue(inputq, sk, dport, &xmitq);
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
spin_unlock_bh(&sk->sk_lock.slock);
|
|
|
|
}
|
2016-06-17 17:35:57 +07:00
|
|
|
/* Send pending response/rejected messages, if any */
|
2017-10-13 16:04:21 +07:00
|
|
|
tipc_node_distr_xmit(sock_net(sk), &xmitq);
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
sock_put(sk);
|
|
|
|
continue;
|
|
|
|
}
|
2015-07-22 21:11:20 +07:00
|
|
|
/* No destination socket => dequeue skb if still there */
|
|
|
|
skb = tipc_skb_dequeue(inputq, dport);
|
|
|
|
if (!skb)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Try secondary lookup if unresolved named message */
|
|
|
|
err = TIPC_ERR_NO_PORT;
|
|
|
|
if (tipc_msg_lookup_dest(net, skb, &err))
|
|
|
|
goto xmit;
|
|
|
|
|
|
|
|
/* Prepare for message rejection */
|
|
|
|
if (!tipc_msg_reverse(tipc_own_addr(net), &skb, err))
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
continue;
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
|
|
|
|
trace_tipc_sk_rej_msg(NULL, skb, TIPC_DUMP_NONE, "@sk_rcv!");
|
tipc: split up function tipc_msg_eval()
The function tipc_msg_eval() is in reality doing two related, but
different tasks. First it tries to find a new destination for named
messages, in case there was no first lookup, or if the first lookup
failed. Second, it does what its name suggests, evaluating the validity
of the message and its destination, and returning an appropriate error
code depending on the result.
This is confusing, and in this commit we choose to break it up into two
functions. A new function, tipc_msg_lookup_dest(), first attempts to find
a new destination, if the message is of the right type. If this lookup
fails, or if the message should not be subject to a second lookup, the
already existing tipc_msg_reverse() is called. This function performs
prepares the message for rejection, if applicable.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:39 +07:00
|
|
|
xmit:
|
2015-07-22 21:11:20 +07:00
|
|
|
dnode = msg_destnode(buf_msg(skb));
|
2015-07-17 03:54:24 +07:00
|
|
|
tipc_node_xmit_skb(net, skb, dnode, dport);
|
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 20:36:41 +07:00
|
|
|
}
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2014-01-17 08:50:03 +07:00
|
|
|
static int tipc_wait_for_connect(struct socket *sock, long *timeo_p)
|
|
|
|
{
|
2016-11-12 01:20:50 +07:00
|
|
|
DEFINE_WAIT_FUNC(wait, woken_wake_function);
|
2014-01-17 08:50:03 +07:00
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
int done;
|
|
|
|
|
|
|
|
do {
|
|
|
|
int err = sock_error(sk);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
if (!*timeo_p)
|
|
|
|
return -ETIMEDOUT;
|
|
|
|
if (signal_pending(current))
|
|
|
|
return sock_intr_errno(*timeo_p);
|
2020-02-10 15:35:44 +07:00
|
|
|
if (sk->sk_state == TIPC_DISCONNECTING)
|
|
|
|
break;
|
2014-01-17 08:50:03 +07:00
|
|
|
|
2016-11-12 01:20:50 +07:00
|
|
|
add_wait_queue(sk_sleep(sk), &wait);
|
2020-01-08 09:19:00 +07:00
|
|
|
done = sk_wait_event(sk, timeo_p, tipc_sk_connected(sk),
|
|
|
|
&wait);
|
2016-11-12 01:20:50 +07:00
|
|
|
remove_wait_queue(sk_sleep(sk), &wait);
|
2014-01-17 08:50:03 +07:00
|
|
|
} while (!done);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-03-18 00:46:42 +07:00
|
|
|
static bool tipc_sockaddr_is_sane(struct sockaddr_tipc *addr)
|
|
|
|
{
|
|
|
|
if (addr->family != AF_TIPC)
|
|
|
|
return false;
|
|
|
|
if (addr->addrtype == TIPC_SERVICE_RANGE)
|
|
|
|
return (addr->addr.nameseq.lower <= addr->addr.nameseq.upper);
|
|
|
|
return (addr->addrtype == TIPC_SERVICE_ADDR ||
|
|
|
|
addr->addrtype == TIPC_SOCKET_ADDR);
|
|
|
|
}
|
|
|
|
|
2006-01-03 01:04:38 +07:00
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_connect - establish a connection to another TIPC port
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket structure
|
|
|
|
* @dest: socket address for destination port
|
|
|
|
* @destlen: size of socket address data structure
|
2008-04-15 14:22:02 +07:00
|
|
|
* @flags: file-related flags associated with socket
|
2006-01-03 01:04:38 +07:00
|
|
|
*
|
|
|
|
* Returns 0 on success, errno otherwise
|
|
|
|
*/
|
2014-02-18 15:06:46 +07:00
|
|
|
static int tipc_connect(struct socket *sock, struct sockaddr *dest,
|
|
|
|
int destlen, int flags)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-04-15 14:22:02 +07:00
|
|
|
struct sock *sk = sock->sk;
|
2015-03-19 15:02:19 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
2008-04-15 14:20:37 +07:00
|
|
|
struct sockaddr_tipc *dst = (struct sockaddr_tipc *)dest;
|
|
|
|
struct msghdr m = {NULL,};
|
2015-03-19 15:02:19 +07:00
|
|
|
long timeout = (flags & O_NONBLOCK) ? 0 : tsk->conn_timeout;
|
2016-11-01 20:02:48 +07:00
|
|
|
int previous;
|
2015-03-19 15:02:19 +07:00
|
|
|
int res = 0;
|
2008-04-15 14:20:37 +07:00
|
|
|
|
2017-10-13 16:04:18 +07:00
|
|
|
if (destlen != sizeof(struct sockaddr_tipc))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
lock_sock(sk);
|
|
|
|
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
if (tsk->group) {
|
|
|
|
res = -EINVAL;
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
2017-10-13 16:04:18 +07:00
|
|
|
if (dst->family == AF_UNSPEC) {
|
|
|
|
memset(&tsk->peer, 0, sizeof(struct sockaddr_tipc));
|
|
|
|
if (!tipc_sk_type_connectionless(sk))
|
2015-03-24 02:30:00 +07:00
|
|
|
res = -EINVAL;
|
2008-04-15 14:22:02 +07:00
|
|
|
goto exit;
|
|
|
|
}
|
2019-03-18 00:46:42 +07:00
|
|
|
if (!tipc_sockaddr_is_sane(dst)) {
|
2008-04-15 14:22:02 +07:00
|
|
|
res = -EINVAL;
|
2017-10-13 16:04:18 +07:00
|
|
|
goto exit;
|
2019-03-18 00:46:42 +07:00
|
|
|
}
|
2017-10-13 16:04:18 +07:00
|
|
|
/* DGRAM/RDM connect(), just save the destaddr */
|
|
|
|
if (tipc_sk_type_connectionless(sk)) {
|
|
|
|
memcpy(&tsk->peer, dest, destlen);
|
2008-04-15 14:22:02 +07:00
|
|
|
goto exit;
|
2019-03-18 00:46:42 +07:00
|
|
|
} else if (dst->addrtype == TIPC_SERVICE_RANGE) {
|
|
|
|
res = -EINVAL;
|
|
|
|
goto exit;
|
2008-04-15 14:22:02 +07:00
|
|
|
}
|
|
|
|
|
2016-11-01 20:02:48 +07:00
|
|
|
previous = sk->sk_state;
|
2016-11-01 20:02:45 +07:00
|
|
|
|
|
|
|
switch (sk->sk_state) {
|
|
|
|
case TIPC_OPEN:
|
tipc: introduce non-blocking socket connect
TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.
With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.
The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.
It is also now possible to call select() or poll() to wait for the
completion of a connection.
An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-11-30 06:51:19 +07:00
|
|
|
/* Send a 'SYN-' to destination */
|
|
|
|
m.msg_name = dest;
|
|
|
|
m.msg_namelen = destlen;
|
|
|
|
|
|
|
|
/* If connect is in non-blocking case, set MSG_DONTWAIT to
|
|
|
|
* indicate send_msg() is never blocked.
|
|
|
|
*/
|
|
|
|
if (!timeout)
|
|
|
|
m.msg_flags = MSG_DONTWAIT;
|
|
|
|
|
2015-03-02 14:37:47 +07:00
|
|
|
res = __tipc_sendmsg(sock, &m, 0);
|
tipc: introduce non-blocking socket connect
TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.
With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.
The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.
It is also now possible to call select() or poll() to wait for the
completion of a connection.
An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-11-30 06:51:19 +07:00
|
|
|
if ((res < 0) && (res != -EWOULDBLOCK))
|
|
|
|
goto exit;
|
|
|
|
|
2016-11-01 20:02:48 +07:00
|
|
|
/* Just entered TIPC_CONNECTING state; the only
|
tipc: introduce non-blocking socket connect
TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.
With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.
The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.
It is also now possible to call select() or poll() to wait for the
completion of a connection.
An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-11-30 06:51:19 +07:00
|
|
|
* difference is that return value in non-blocking
|
|
|
|
* case is EINPROGRESS, rather than EALREADY.
|
|
|
|
*/
|
|
|
|
res = -EINPROGRESS;
|
2020-08-24 05:36:59 +07:00
|
|
|
fallthrough;
|
2016-11-01 20:02:48 +07:00
|
|
|
case TIPC_CONNECTING:
|
|
|
|
if (!timeout) {
|
|
|
|
if (previous == TIPC_CONNECTING)
|
|
|
|
res = -EALREADY;
|
2014-01-17 08:50:03 +07:00
|
|
|
goto exit;
|
2016-11-01 20:02:48 +07:00
|
|
|
}
|
2014-01-17 08:50:03 +07:00
|
|
|
timeout = msecs_to_jiffies(timeout);
|
|
|
|
/* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
|
|
|
|
res = tipc_wait_for_connect(sock, &timeout);
|
2016-11-01 20:02:49 +07:00
|
|
|
break;
|
|
|
|
case TIPC_ESTABLISHED:
|
tipc: introduce non-blocking socket connect
TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.
With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.
The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.
It is also now possible to call select() or poll() to wait for the
completion of a connection.
An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-11-30 06:51:19 +07:00
|
|
|
res = -EISCONN;
|
2016-11-01 20:02:49 +07:00
|
|
|
break;
|
|
|
|
default:
|
tipc: introduce non-blocking socket connect
TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.
With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.
The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.
It is also now possible to call select() or poll() to wait for the
completion of a connection.
An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-11-30 06:51:19 +07:00
|
|
|
res = -EINVAL;
|
2016-11-01 20:02:49 +07:00
|
|
|
}
|
2016-11-01 20:02:48 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
exit:
|
|
|
|
release_sock(sk);
|
2008-04-15 14:20:37 +07:00
|
|
|
return res;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_listen - allow socket to listen for incoming connections
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket structure
|
|
|
|
* @len: (unused)
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Returns 0 on success, errno otherwise
|
|
|
|
*/
|
2014-02-18 15:06:46 +07:00
|
|
|
static int tipc_listen(struct socket *sock, int len)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-04-15 14:22:02 +07:00
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
int res;
|
|
|
|
|
|
|
|
lock_sock(sk);
|
2016-11-01 20:02:43 +07:00
|
|
|
res = tipc_set_sk_state(sk, TIPC_LISTEN);
|
2008-04-15 14:22:02 +07:00
|
|
|
release_sock(sk);
|
2016-11-01 20:02:43 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
return res;
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2014-01-17 08:50:04 +07:00
|
|
|
static int tipc_wait_for_accept(struct socket *sock, long timeo)
|
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
DEFINE_WAIT(wait);
|
|
|
|
int err;
|
|
|
|
|
|
|
|
/* True wake-one mechanism for incoming connections: only
|
|
|
|
* one process gets woken up, not the 'whole herd'.
|
|
|
|
* Since we do not 'race & poll' for established sockets
|
|
|
|
* anymore, the common case will execute the loop only once.
|
|
|
|
*/
|
|
|
|
for (;;) {
|
|
|
|
prepare_to_wait_exclusive(sk_sleep(sk), &wait,
|
|
|
|
TASK_INTERRUPTIBLE);
|
2014-03-06 20:40:18 +07:00
|
|
|
if (timeo && skb_queue_empty(&sk->sk_receive_queue)) {
|
2014-01-17 08:50:04 +07:00
|
|
|
release_sock(sk);
|
|
|
|
timeo = schedule_timeout(timeo);
|
|
|
|
lock_sock(sk);
|
|
|
|
}
|
|
|
|
err = 0;
|
|
|
|
if (!skb_queue_empty(&sk->sk_receive_queue))
|
|
|
|
break;
|
|
|
|
err = -EAGAIN;
|
|
|
|
if (!timeo)
|
|
|
|
break;
|
2015-03-09 16:43:42 +07:00
|
|
|
err = sock_intr_errno(timeo);
|
|
|
|
if (signal_pending(current))
|
|
|
|
break;
|
2014-01-17 08:50:04 +07:00
|
|
|
}
|
|
|
|
finish_wait(sk_sleep(sk), &wait);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_accept - wait for connection request
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: listening socket
|
2020-07-13 06:15:14 +07:00
|
|
|
* @new_sock: new socket that is to be connected
|
2006-01-03 01:04:38 +07:00
|
|
|
* @flags: file-related flags associated with socket
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Returns 0 on success, errno otherwise
|
|
|
|
*/
|
2017-03-09 15:09:05 +07:00
|
|
|
static int tipc_accept(struct socket *sock, struct socket *new_sock, int flags,
|
|
|
|
bool kern)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2012-12-04 23:01:55 +07:00
|
|
|
struct sock *new_sk, *sk = sock->sk;
|
2006-01-03 01:04:38 +07:00
|
|
|
struct sk_buff *buf;
|
2014-08-23 05:09:20 +07:00
|
|
|
struct tipc_sock *new_tsock;
|
2012-12-04 23:01:55 +07:00
|
|
|
struct tipc_msg *msg;
|
2014-01-17 08:50:04 +07:00
|
|
|
long timeo;
|
2008-04-15 14:22:02 +07:00
|
|
|
int res;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
lock_sock(sk);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2016-11-01 20:02:43 +07:00
|
|
|
if (sk->sk_state != TIPC_LISTEN) {
|
2008-04-15 14:22:02 +07:00
|
|
|
res = -EINVAL;
|
2006-01-03 01:04:38 +07:00
|
|
|
goto exit;
|
|
|
|
}
|
2014-01-17 08:50:04 +07:00
|
|
|
timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);
|
|
|
|
res = tipc_wait_for_accept(sock, timeo);
|
|
|
|
if (res)
|
|
|
|
goto exit;
|
2008-04-15 14:22:02 +07:00
|
|
|
|
|
|
|
buf = skb_peek(&sk->sk_receive_queue);
|
|
|
|
|
2017-03-09 15:09:05 +07:00
|
|
|
res = tipc_sk_create(sock_net(sock->sk), new_sock, 0, kern);
|
2012-12-04 23:01:55 +07:00
|
|
|
if (res)
|
|
|
|
goto exit;
|
2015-07-07 20:43:45 +07:00
|
|
|
security_sk_clone(sock->sk, new_sock->sk);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2012-12-04 23:01:55 +07:00
|
|
|
new_sk = new_sock->sk;
|
2014-08-23 05:09:20 +07:00
|
|
|
new_tsock = tipc_sk(new_sk);
|
2012-12-04 23:01:55 +07:00
|
|
|
msg = buf_msg(buf);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2012-12-04 23:01:55 +07:00
|
|
|
/* we lock on new_sk; but lockdep sees the lock on sk */
|
|
|
|
lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reject any stray messages received by new socket
|
|
|
|
* before the socket lock was taken (very, very unlikely)
|
|
|
|
*/
|
2020-01-08 09:18:15 +07:00
|
|
|
tsk_rej_rx_queue(new_sk, TIPC_ERR_NO_PORT);
|
2012-12-04 23:01:55 +07:00
|
|
|
|
|
|
|
/* Connect new socket to it's peer */
|
2014-08-23 05:09:20 +07:00
|
|
|
tipc_sk_finish_conn(new_tsock, msg_origport(msg), msg_orignode(msg));
|
2012-12-04 23:01:55 +07:00
|
|
|
|
2020-05-28 12:12:36 +07:00
|
|
|
tsk_set_importance(new_sk, msg_importance(msg));
|
2012-12-04 23:01:55 +07:00
|
|
|
if (msg_named(msg)) {
|
2014-08-23 05:09:20 +07:00
|
|
|
new_tsock->conn_type = msg_nametype(msg);
|
|
|
|
new_tsock->conn_instance = msg_nameinst(msg);
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
2012-12-04 23:01:55 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Respond to 'SYN-' by discarding it & returning 'ACK'-.
|
|
|
|
* Respond to 'SYN+' by queuing it on new socket.
|
|
|
|
*/
|
|
|
|
if (!msg_data_sz(msg)) {
|
|
|
|
struct msghdr m = {NULL,};
|
|
|
|
|
2014-08-23 05:09:18 +07:00
|
|
|
tsk_advance_rx_queue(sk);
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
__tipc_sendstream(new_sock, &m, 0);
|
2012-12-04 23:01:55 +07:00
|
|
|
} else {
|
|
|
|
__skb_dequeue(&sk->sk_receive_queue);
|
|
|
|
__skb_queue_head(&new_sk->sk_receive_queue, buf);
|
2013-01-21 05:30:09 +07:00
|
|
|
skb_set_owner_r(buf, new_sk);
|
2012-12-04 23:01:55 +07:00
|
|
|
}
|
|
|
|
release_sock(new_sk);
|
2006-01-03 01:04:38 +07:00
|
|
|
exit:
|
2008-04-15 14:22:02 +07:00
|
|
|
release_sock(sk);
|
2006-01-03 01:04:38 +07:00
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_shutdown - shutdown socket connection
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket structure
|
2008-03-07 06:05:38 +07:00
|
|
|
* @how: direction to close (must be SHUT_RDWR)
|
2006-01-03 01:04:38 +07:00
|
|
|
*
|
|
|
|
* Terminates connection (if necessary), then purges socket's receive queue.
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Returns 0 on success, errno otherwise
|
|
|
|
*/
|
2014-02-18 15:06:46 +07:00
|
|
|
static int tipc_shutdown(struct socket *sock, int how)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-04-15 14:22:02 +07:00
|
|
|
struct sock *sk = sock->sk;
|
2006-01-03 01:04:38 +07:00
|
|
|
int res;
|
|
|
|
|
2008-03-07 06:05:38 +07:00
|
|
|
if (how != SHUT_RDWR)
|
|
|
|
return -EINVAL;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
lock_sock(sk);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
trace_tipc_sk_shutdown(sk, NULL, TIPC_DUMP_ALL, " ");
|
2016-11-01 20:02:47 +07:00
|
|
|
__tipc_shutdown(sock, TIPC_CONN_SHUTDOWN);
|
tipc: fix shutdown() of connectionless socket
syzbot is reporting hung task at nbd_ioctl() [1], for there are two
problems regarding TIPC's connectionless socket's shutdown() operation.
----------
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <linux/nbd.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
const int fd = open("/dev/nbd0", 3);
alarm(5);
ioctl(fd, NBD_SET_SOCK, socket(PF_TIPC, SOCK_DGRAM, 0));
ioctl(fd, NBD_DO_IT, 0); /* To be interrupted by SIGALRM. */
return 0;
}
----------
One problem is that wait_for_completion() from flush_workqueue() from
nbd_start_device_ioctl() from nbd_ioctl() cannot be completed when
nbd_start_device_ioctl() received a signal at wait_event_interruptible(),
for tipc_shutdown() from kernel_sock_shutdown(SHUT_RDWR) from
nbd_mark_nsock_dead() from sock_shutdown() from nbd_start_device_ioctl()
is failing to wake up a WQ thread sleeping at wait_woken() from
tipc_wait_for_rcvmsg() from sock_recvmsg() from sock_xmit() from
nbd_read_stat() from recv_work() scheduled by nbd_start_device() from
nbd_start_device_ioctl(). Fix this problem by always invoking
sk->sk_state_change() (like inet_shutdown() does) when tipc_shutdown() is
called.
The other problem is that tipc_wait_for_rcvmsg() cannot return when
tipc_shutdown() is called, for tipc_shutdown() sets sk->sk_shutdown to
SEND_SHUTDOWN (despite "how" is SHUT_RDWR) while tipc_wait_for_rcvmsg()
needs sk->sk_shutdown set to RCV_SHUTDOWN or SHUTDOWN_MASK. Fix this
problem by setting sk->sk_shutdown to SHUTDOWN_MASK (like inet_shutdown()
does) when the socket is connectionless.
[1] https://syzkaller.appspot.com/bug?id=3fe51d307c1f0a845485cf1798aa059d12bf18b2
Reported-by: syzbot <syzbot+e36f41d207137b5d12f7@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-02 20:44:16 +07:00
|
|
|
if (tipc_sk_type_connectionless(sk))
|
|
|
|
sk->sk_shutdown = SHUTDOWN_MASK;
|
|
|
|
else
|
|
|
|
sk->sk_shutdown = SEND_SHUTDOWN;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2016-11-01 20:02:47 +07:00
|
|
|
if (sk->sk_state == TIPC_DISCONNECTING) {
|
2012-10-29 20:38:15 +07:00
|
|
|
/* Discard any unreceived messages */
|
2013-01-21 05:30:08 +07:00
|
|
|
__skb_queue_purge(&sk->sk_receive_queue);
|
2012-10-29 20:38:15 +07:00
|
|
|
|
2006-01-03 01:04:38 +07:00
|
|
|
res = 0;
|
2016-11-01 20:02:47 +07:00
|
|
|
} else {
|
2006-01-03 01:04:38 +07:00
|
|
|
res = -ENOTCONN;
|
|
|
|
}
|
tipc: fix shutdown() of connectionless socket
syzbot is reporting hung task at nbd_ioctl() [1], for there are two
problems regarding TIPC's connectionless socket's shutdown() operation.
----------
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <linux/nbd.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
const int fd = open("/dev/nbd0", 3);
alarm(5);
ioctl(fd, NBD_SET_SOCK, socket(PF_TIPC, SOCK_DGRAM, 0));
ioctl(fd, NBD_DO_IT, 0); /* To be interrupted by SIGALRM. */
return 0;
}
----------
One problem is that wait_for_completion() from flush_workqueue() from
nbd_start_device_ioctl() from nbd_ioctl() cannot be completed when
nbd_start_device_ioctl() received a signal at wait_event_interruptible(),
for tipc_shutdown() from kernel_sock_shutdown(SHUT_RDWR) from
nbd_mark_nsock_dead() from sock_shutdown() from nbd_start_device_ioctl()
is failing to wake up a WQ thread sleeping at wait_woken() from
tipc_wait_for_rcvmsg() from sock_recvmsg() from sock_xmit() from
nbd_read_stat() from recv_work() scheduled by nbd_start_device() from
nbd_start_device_ioctl(). Fix this problem by always invoking
sk->sk_state_change() (like inet_shutdown() does) when tipc_shutdown() is
called.
The other problem is that tipc_wait_for_rcvmsg() cannot return when
tipc_shutdown() is called, for tipc_shutdown() sets sk->sk_shutdown to
SEND_SHUTDOWN (despite "how" is SHUT_RDWR) while tipc_wait_for_rcvmsg()
needs sk->sk_shutdown set to RCV_SHUTDOWN or SHUTDOWN_MASK. Fix this
problem by setting sk->sk_shutdown to SHUTDOWN_MASK (like inet_shutdown()
does) when the socket is connectionless.
[1] https://syzkaller.appspot.com/bug?id=3fe51d307c1f0a845485cf1798aa059d12bf18b2
Reported-by: syzbot <syzbot+e36f41d207137b5d12f7@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-02 20:44:16 +07:00
|
|
|
/* Wake up anyone sleeping in poll. */
|
|
|
|
sk->sk_state_change(sk);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
release_sock(sk);
|
2006-01-03 01:04:38 +07:00
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2018-09-29 01:23:19 +07:00
|
|
|
static void tipc_sk_check_probing_state(struct sock *sk,
|
|
|
|
struct sk_buff_head *list)
|
|
|
|
{
|
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
|
|
|
u32 pnode = tsk_peer_node(tsk);
|
|
|
|
u32 pport = tsk_peer_port(tsk);
|
|
|
|
u32 self = tsk_own_node(tsk);
|
|
|
|
u32 oport = tsk->portid;
|
|
|
|
struct sk_buff *skb;
|
|
|
|
|
|
|
|
if (tsk->probe_unacked) {
|
|
|
|
tipc_set_sk_state(sk, TIPC_DISCONNECTING);
|
|
|
|
sk->sk_err = ECONNABORTED;
|
|
|
|
tipc_node_remove_conn(sock_net(sk), pnode, pport);
|
|
|
|
sk->sk_state_change(sk);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
/* Prepare new probe */
|
|
|
|
skb = tipc_msg_create(CONN_MANAGER, CONN_PROBE, INT_H_SIZE, 0,
|
|
|
|
pnode, self, pport, oport, TIPC_OK);
|
|
|
|
if (skb)
|
|
|
|
__skb_queue_tail(list, skb);
|
|
|
|
tsk->probe_unacked = true;
|
|
|
|
sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV);
|
|
|
|
}
|
|
|
|
|
2018-09-29 01:23:22 +07:00
|
|
|
static void tipc_sk_retry_connect(struct sock *sk, struct sk_buff_head *list)
|
|
|
|
{
|
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
|
|
|
|
|
|
|
/* Try again later if dest link is congested */
|
|
|
|
if (tsk->cong_link_cnt) {
|
|
|
|
sk_reset_timer(sk, &sk->sk_timer, msecs_to_jiffies(100));
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
/* Prepare SYN for retransmit */
|
|
|
|
tipc_msg_skb_clone(&sk->sk_write_queue, list);
|
|
|
|
}
|
|
|
|
|
2017-10-31 04:06:45 +07:00
|
|
|
static void tipc_sk_timeout(struct timer_list *t)
|
2014-08-23 05:09:09 +07:00
|
|
|
{
|
2017-10-31 04:06:45 +07:00
|
|
|
struct sock *sk = from_timer(sk, t, sk_timer);
|
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
2018-09-29 01:23:19 +07:00
|
|
|
u32 pnode = tsk_peer_node(tsk);
|
|
|
|
struct sk_buff_head list;
|
2018-09-29 01:23:22 +07:00
|
|
|
int rc = 0;
|
2014-08-23 05:09:09 +07:00
|
|
|
|
2019-08-15 21:42:50 +07:00
|
|
|
__skb_queue_head_init(&list);
|
2014-08-23 05:09:16 +07:00
|
|
|
bh_lock_sock(sk);
|
2017-10-20 16:21:32 +07:00
|
|
|
|
|
|
|
/* Try again later if socket is busy */
|
|
|
|
if (sock_owned_by_user(sk)) {
|
|
|
|
sk_reset_timer(sk, &sk->sk_timer, jiffies + HZ / 20);
|
2018-09-29 01:23:19 +07:00
|
|
|
bh_unlock_sock(sk);
|
2019-11-28 10:10:06 +07:00
|
|
|
sock_put(sk);
|
2018-09-29 01:23:19 +07:00
|
|
|
return;
|
2014-08-23 05:09:09 +07:00
|
|
|
}
|
|
|
|
|
2018-09-29 01:23:19 +07:00
|
|
|
if (sk->sk_state == TIPC_ESTABLISHED)
|
|
|
|
tipc_sk_check_probing_state(sk, &list);
|
2018-09-29 01:23:22 +07:00
|
|
|
else if (sk->sk_state == TIPC_CONNECTING)
|
|
|
|
tipc_sk_retry_connect(sk, &list);
|
2018-09-29 01:23:19 +07:00
|
|
|
|
2014-08-23 05:09:09 +07:00
|
|
|
bh_unlock_sock(sk);
|
2018-09-29 01:23:19 +07:00
|
|
|
|
|
|
|
if (!skb_queue_empty(&list))
|
2018-09-29 01:23:22 +07:00
|
|
|
rc = tipc_node_xmit(sock_net(sk), &list, pnode, tsk->portid);
|
2018-09-29 01:23:19 +07:00
|
|
|
|
2018-09-29 01:23:22 +07:00
|
|
|
/* SYN messages may cause link congestion */
|
|
|
|
if (rc == -ELINKCONG) {
|
|
|
|
tipc_dest_push(&tsk->cong_links, pnode, 0);
|
|
|
|
tsk->cong_link_cnt = 1;
|
|
|
|
}
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
sock_put(sk);
|
2014-08-23 05:09:09 +07:00
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:20 +07:00
|
|
|
static int tipc_sk_publish(struct tipc_sock *tsk, uint scope,
|
2014-08-23 05:09:17 +07:00
|
|
|
struct tipc_name_seq const *seq)
|
|
|
|
{
|
2016-11-01 20:02:40 +07:00
|
|
|
struct sock *sk = &tsk->sk;
|
|
|
|
struct net *net = sock_net(sk);
|
2014-08-23 05:09:17 +07:00
|
|
|
struct publication *publ;
|
|
|
|
u32 key;
|
|
|
|
|
2018-03-15 22:48:51 +07:00
|
|
|
if (scope != TIPC_NODE_SCOPE)
|
|
|
|
scope = TIPC_CLUSTER_SCOPE;
|
|
|
|
|
2016-11-01 20:02:40 +07:00
|
|
|
if (tipc_sk_connected(sk))
|
2014-08-23 05:09:17 +07:00
|
|
|
return -EINVAL;
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
key = tsk->portid + tsk->pub_count + 1;
|
|
|
|
if (key == tsk->portid)
|
2014-08-23 05:09:17 +07:00
|
|
|
return -EADDRINUSE;
|
|
|
|
|
2015-01-09 14:27:05 +07:00
|
|
|
publ = tipc_nametbl_publish(net, seq->type, seq->lower, seq->upper,
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
scope, tsk->portid, key);
|
2014-08-23 05:09:17 +07:00
|
|
|
if (unlikely(!publ))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2018-03-15 22:48:55 +07:00
|
|
|
list_add(&publ->binding_sock, &tsk->publications);
|
2014-08-23 05:09:20 +07:00
|
|
|
tsk->pub_count++;
|
|
|
|
tsk->published = 1;
|
2014-08-23 05:09:17 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:20 +07:00
|
|
|
static int tipc_sk_withdraw(struct tipc_sock *tsk, uint scope,
|
2014-08-23 05:09:17 +07:00
|
|
|
struct tipc_name_seq const *seq)
|
|
|
|
{
|
2015-01-09 14:27:05 +07:00
|
|
|
struct net *net = sock_net(&tsk->sk);
|
2014-08-23 05:09:17 +07:00
|
|
|
struct publication *publ;
|
|
|
|
struct publication *safe;
|
|
|
|
int rc = -EINVAL;
|
|
|
|
|
2018-03-15 22:48:51 +07:00
|
|
|
if (scope != TIPC_NODE_SCOPE)
|
|
|
|
scope = TIPC_CLUSTER_SCOPE;
|
|
|
|
|
2018-03-15 22:48:55 +07:00
|
|
|
list_for_each_entry_safe(publ, safe, &tsk->publications, binding_sock) {
|
2014-08-23 05:09:17 +07:00
|
|
|
if (seq) {
|
|
|
|
if (publ->scope != scope)
|
|
|
|
continue;
|
|
|
|
if (publ->type != seq->type)
|
|
|
|
continue;
|
|
|
|
if (publ->lower != seq->lower)
|
|
|
|
continue;
|
|
|
|
if (publ->upper != seq->upper)
|
|
|
|
break;
|
2015-01-09 14:27:05 +07:00
|
|
|
tipc_nametbl_withdraw(net, publ->type, publ->lower,
|
2018-03-30 04:20:43 +07:00
|
|
|
publ->upper, publ->key);
|
2014-08-23 05:09:17 +07:00
|
|
|
rc = 0;
|
|
|
|
break;
|
|
|
|
}
|
2015-01-09 14:27:05 +07:00
|
|
|
tipc_nametbl_withdraw(net, publ->type, publ->lower,
|
2018-03-30 04:20:43 +07:00
|
|
|
publ->upper, publ->key);
|
2014-08-23 05:09:17 +07:00
|
|
|
rc = 0;
|
|
|
|
}
|
2014-08-23 05:09:20 +07:00
|
|
|
if (list_empty(&tsk->publications))
|
|
|
|
tsk->published = 0;
|
2014-08-23 05:09:17 +07:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2014-08-23 05:09:14 +07:00
|
|
|
/* tipc_sk_reinit: set non-zero address in all existing sockets
|
|
|
|
* when we go from standalone to network mode.
|
|
|
|
*/
|
2015-01-09 14:27:08 +07:00
|
|
|
void tipc_sk_reinit(struct net *net)
|
2014-08-23 05:09:14 +07:00
|
|
|
{
|
2015-01-09 14:27:08 +07:00
|
|
|
struct tipc_net *tn = net_generic(net, tipc_net_id);
|
2017-02-11 18:26:46 +07:00
|
|
|
struct rhashtable_iter iter;
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
struct tipc_sock *tsk;
|
2014-08-23 05:09:14 +07:00
|
|
|
struct tipc_msg *msg;
|
|
|
|
|
2017-02-11 18:26:46 +07:00
|
|
|
rhashtable_walk_enter(&tn->sk_rht, &iter);
|
|
|
|
|
|
|
|
do {
|
2017-12-05 01:31:41 +07:00
|
|
|
rhashtable_walk_start(&iter);
|
2017-02-11 18:26:46 +07:00
|
|
|
|
|
|
|
while ((tsk = rhashtable_walk_next(&iter)) && !IS_ERR(tsk)) {
|
2018-12-11 02:49:55 +07:00
|
|
|
sock_hold(&tsk->sk);
|
|
|
|
rhashtable_walk_stop(&iter);
|
|
|
|
lock_sock(&tsk->sk);
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
msg = &tsk->phdr;
|
2018-03-23 02:42:49 +07:00
|
|
|
msg_set_prevnode(msg, tipc_own_addr(net));
|
|
|
|
msg_set_orignode(msg, tipc_own_addr(net));
|
2018-12-11 02:49:55 +07:00
|
|
|
release_sock(&tsk->sk);
|
|
|
|
rhashtable_walk_start(&iter);
|
|
|
|
sock_put(&tsk->sk);
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
}
|
2017-12-05 01:31:41 +07:00
|
|
|
|
2017-02-11 18:26:46 +07:00
|
|
|
rhashtable_walk_stop(&iter);
|
|
|
|
} while (tsk == ERR_PTR(-EAGAIN));
|
2018-08-24 06:19:44 +07:00
|
|
|
|
|
|
|
rhashtable_walk_exit(&iter);
|
2014-08-23 05:09:14 +07:00
|
|
|
}
|
|
|
|
|
2015-01-09 14:27:08 +07:00
|
|
|
static struct tipc_sock *tipc_sk_lookup(struct net *net, u32 portid)
|
2014-08-23 05:09:19 +07:00
|
|
|
{
|
2015-01-09 14:27:08 +07:00
|
|
|
struct tipc_net *tn = net_generic(net, tipc_net_id);
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
struct tipc_sock *tsk;
|
2014-08-23 05:09:19 +07:00
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
rcu_read_lock();
|
2019-11-22 15:15:19 +07:00
|
|
|
tsk = rhashtable_lookup(&tn->sk_rht, &portid, tsk_rht_params);
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
if (tsk)
|
|
|
|
sock_hold(&tsk->sk);
|
|
|
|
rcu_read_unlock();
|
2014-08-23 05:09:19 +07:00
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
return tsk;
|
2014-08-23 05:09:19 +07:00
|
|
|
}
|
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
static int tipc_sk_insert(struct tipc_sock *tsk)
|
2014-08-23 05:09:19 +07:00
|
|
|
{
|
2015-01-09 14:27:08 +07:00
|
|
|
struct sock *sk = &tsk->sk;
|
|
|
|
struct net *net = sock_net(sk);
|
|
|
|
struct tipc_net *tn = net_generic(net, tipc_net_id);
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
u32 remaining = (TIPC_MAX_PORT - TIPC_MIN_PORT) + 1;
|
|
|
|
u32 portid = prandom_u32() % remaining + TIPC_MIN_PORT;
|
2014-08-23 05:09:19 +07:00
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
while (remaining--) {
|
|
|
|
portid++;
|
|
|
|
if ((portid < TIPC_MIN_PORT) || (portid > TIPC_MAX_PORT))
|
|
|
|
portid = TIPC_MIN_PORT;
|
|
|
|
tsk->portid = portid;
|
|
|
|
sock_hold(&tsk->sk);
|
2015-03-20 17:57:05 +07:00
|
|
|
if (!rhashtable_lookup_insert_fast(&tn->sk_rht, &tsk->node,
|
|
|
|
tsk_rht_params))
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
return 0;
|
|
|
|
sock_put(&tsk->sk);
|
2014-08-23 05:09:19 +07:00
|
|
|
}
|
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
return -1;
|
2014-08-23 05:09:19 +07:00
|
|
|
}
|
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
static void tipc_sk_remove(struct tipc_sock *tsk)
|
2014-08-23 05:09:19 +07:00
|
|
|
{
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
struct sock *sk = &tsk->sk;
|
2015-01-09 14:27:08 +07:00
|
|
|
struct tipc_net *tn = net_generic(sock_net(sk), tipc_net_id);
|
2014-08-23 05:09:19 +07:00
|
|
|
|
2015-03-20 17:57:05 +07:00
|
|
|
if (!rhashtable_remove_fast(&tn->sk_rht, &tsk->node, tsk_rht_params)) {
|
2017-06-30 17:08:01 +07:00
|
|
|
WARN_ON(refcount_read(&sk->sk_refcnt) == 1);
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
__sock_put(sk);
|
2014-08-23 05:09:19 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-03-20 17:57:05 +07:00
|
|
|
static const struct rhashtable_params tsk_rht_params = {
|
|
|
|
.nelem_hint = 192,
|
|
|
|
.head_offset = offsetof(struct tipc_sock, node),
|
|
|
|
.key_offset = offsetof(struct tipc_sock, portid),
|
|
|
|
.key_len = sizeof(u32), /* portid */
|
|
|
|
.max_size = 1048576,
|
|
|
|
.min_size = 256,
|
2015-03-25 03:42:19 +07:00
|
|
|
.automatic_shrinking = true,
|
2015-03-20 17:57:05 +07:00
|
|
|
};
|
|
|
|
|
2015-01-09 14:27:08 +07:00
|
|
|
int tipc_sk_rht_init(struct net *net)
|
2014-08-23 05:09:19 +07:00
|
|
|
{
|
2015-01-09 14:27:08 +07:00
|
|
|
struct tipc_net *tn = net_generic(net, tipc_net_id);
|
2015-03-20 17:57:05 +07:00
|
|
|
|
|
|
|
return rhashtable_init(&tn->sk_rht, &tsk_rht_params);
|
2014-08-23 05:09:19 +07:00
|
|
|
}
|
|
|
|
|
2015-01-09 14:27:08 +07:00
|
|
|
void tipc_sk_rht_destroy(struct net *net)
|
2014-08-23 05:09:19 +07:00
|
|
|
{
|
2015-01-09 14:27:08 +07:00
|
|
|
struct tipc_net *tn = net_generic(net, tipc_net_id);
|
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
/* Wait for socket readers to complete */
|
|
|
|
synchronize_net();
|
2014-08-23 05:09:19 +07:00
|
|
|
|
2015-01-09 14:27:08 +07:00
|
|
|
rhashtable_destroy(&tn->sk_rht);
|
2014-08-23 05:09:19 +07:00
|
|
|
}
|
|
|
|
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
static int tipc_sk_join(struct tipc_sock *tsk, struct tipc_group_req *mreq)
|
|
|
|
{
|
|
|
|
struct net *net = sock_net(&tsk->sk);
|
|
|
|
struct tipc_group *grp = tsk->group;
|
|
|
|
struct tipc_msg *hdr = &tsk->phdr;
|
|
|
|
struct tipc_name_seq seq;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
if (mreq->type < TIPC_RESERVED_TYPES)
|
|
|
|
return -EACCES;
|
2018-01-09 03:03:30 +07:00
|
|
|
if (mreq->scope > TIPC_NODE_SCOPE)
|
|
|
|
return -EINVAL;
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
if (grp)
|
|
|
|
return -EACCES;
|
2018-01-17 22:42:46 +07:00
|
|
|
grp = tipc_group_create(net, tsk->portid, mreq, &tsk->group_is_open);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
if (!grp)
|
|
|
|
return -ENOMEM;
|
|
|
|
tsk->group = grp;
|
|
|
|
msg_set_lookup_scope(hdr, mreq->scope);
|
|
|
|
msg_set_nametype(hdr, mreq->type);
|
|
|
|
msg_set_dest_droppable(hdr, true);
|
|
|
|
seq.type = mreq->type;
|
|
|
|
seq.lower = mreq->instance;
|
|
|
|
seq.upper = seq.lower;
|
2018-01-09 03:03:30 +07:00
|
|
|
tipc_nametbl_build_group(net, grp, mreq->type, mreq->scope);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
rc = tipc_sk_publish(tsk, mreq->scope, &seq);
|
2017-10-25 05:44:49 +07:00
|
|
|
if (rc) {
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
tipc_group_delete(net, grp);
|
2017-10-25 05:44:49 +07:00
|
|
|
tsk->group = NULL;
|
2018-01-11 03:08:50 +07:00
|
|
|
return rc;
|
2017-10-25 05:44:49 +07:00
|
|
|
}
|
2018-01-09 03:03:28 +07:00
|
|
|
/* Eliminate any risk that a broadcast overtakes sent JOINs */
|
2017-10-13 16:04:32 +07:00
|
|
|
tsk->mc_method.rcast = true;
|
|
|
|
tsk->mc_method.mandatory = true;
|
2018-01-09 03:03:28 +07:00
|
|
|
tipc_group_join(net, grp, &tsk->sk.sk_rcvbuf);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tipc_sk_leave(struct tipc_sock *tsk)
|
|
|
|
{
|
|
|
|
struct net *net = sock_net(&tsk->sk);
|
|
|
|
struct tipc_group *grp = tsk->group;
|
|
|
|
struct tipc_name_seq seq;
|
|
|
|
int scope;
|
|
|
|
|
|
|
|
if (!grp)
|
|
|
|
return -EINVAL;
|
|
|
|
tipc_group_self(grp, &seq, &scope);
|
|
|
|
tipc_group_delete(net, grp);
|
|
|
|
tsk->group = NULL;
|
|
|
|
tipc_sk_withdraw(tsk, scope, &seq);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-01-03 01:04:38 +07:00
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_setsockopt - set socket option
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket structure
|
|
|
|
* @lvl: option level
|
|
|
|
* @opt: option identifier
|
|
|
|
* @ov: pointer to new option value
|
|
|
|
* @ol: length of option value
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
|
|
|
* For stream sockets only, accepts and ignores all IPPROTO_TCP options
|
2006-01-03 01:04:38 +07:00
|
|
|
* (to ease compatibility).
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Returns 0 on success, errno otherwise
|
|
|
|
*/
|
2014-02-18 15:06:46 +07:00
|
|
|
static int tipc_setsockopt(struct socket *sock, int lvl, int opt,
|
2020-07-23 13:09:07 +07:00
|
|
|
sockptr_t ov, unsigned int ol)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-04-15 14:22:02 +07:00
|
|
|
struct sock *sk = sock->sk;
|
2014-03-12 22:31:12 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
struct tipc_group_req mreq;
|
2017-01-19 01:50:53 +07:00
|
|
|
u32 value = 0;
|
2017-01-24 16:49:35 +07:00
|
|
|
int res = 0;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
if ((lvl == IPPROTO_TCP) && (sock->type == SOCK_STREAM))
|
|
|
|
return 0;
|
2006-01-03 01:04:38 +07:00
|
|
|
if (lvl != SOL_TIPC)
|
|
|
|
return -ENOPROTOOPT;
|
2017-01-19 01:50:53 +07:00
|
|
|
|
|
|
|
switch (opt) {
|
|
|
|
case TIPC_IMPORTANCE:
|
|
|
|
case TIPC_SRC_DROPPABLE:
|
|
|
|
case TIPC_DEST_DROPPABLE:
|
|
|
|
case TIPC_CONN_TIMEOUT:
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
case TIPC_NODELAY:
|
2017-01-19 01:50:53 +07:00
|
|
|
if (ol < sizeof(value))
|
|
|
|
return -EINVAL;
|
2020-07-23 13:09:07 +07:00
|
|
|
if (copy_from_sockptr(&value, ov, sizeof(u32)))
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
return -EFAULT;
|
|
|
|
break;
|
|
|
|
case TIPC_GROUP_JOIN:
|
|
|
|
if (ol < sizeof(mreq))
|
|
|
|
return -EINVAL;
|
2020-07-23 13:09:07 +07:00
|
|
|
if (copy_from_sockptr(&mreq, ov, sizeof(mreq)))
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
return -EFAULT;
|
2017-01-19 01:50:53 +07:00
|
|
|
break;
|
|
|
|
default:
|
2020-07-23 13:09:07 +07:00
|
|
|
if (!sockptr_is_null(ov) || ol)
|
2017-01-19 01:50:53 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
lock_sock(sk);
|
2007-02-09 21:25:21 +07:00
|
|
|
|
2006-01-03 01:04:38 +07:00
|
|
|
switch (opt) {
|
|
|
|
case TIPC_IMPORTANCE:
|
2020-05-28 12:12:36 +07:00
|
|
|
res = tsk_set_importance(sk, value);
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
|
|
|
case TIPC_SRC_DROPPABLE:
|
|
|
|
if (sock->type != SOCK_STREAM)
|
2014-08-23 05:09:20 +07:00
|
|
|
tsk_set_unreliable(tsk, value);
|
2007-02-09 21:25:21 +07:00
|
|
|
else
|
2006-01-03 01:04:38 +07:00
|
|
|
res = -ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
case TIPC_DEST_DROPPABLE:
|
2014-08-23 05:09:20 +07:00
|
|
|
tsk_set_unreturnable(tsk, value);
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
|
|
|
case TIPC_CONN_TIMEOUT:
|
2011-05-27 00:44:34 +07:00
|
|
|
tipc_sk(sk)->conn_timeout = value;
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
2017-01-19 01:50:53 +07:00
|
|
|
case TIPC_MCAST_BROADCAST:
|
|
|
|
tsk->mc_method.rcast = false;
|
|
|
|
tsk->mc_method.mandatory = true;
|
|
|
|
break;
|
|
|
|
case TIPC_MCAST_REPLICAST:
|
|
|
|
tsk->mc_method.rcast = true;
|
|
|
|
tsk->mc_method.mandatory = true;
|
|
|
|
break;
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
case TIPC_GROUP_JOIN:
|
|
|
|
res = tipc_sk_join(tsk, &mreq);
|
|
|
|
break;
|
|
|
|
case TIPC_GROUP_LEAVE:
|
|
|
|
res = tipc_sk_leave(tsk);
|
|
|
|
break;
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 20:00:41 +07:00
|
|
|
case TIPC_NODELAY:
|
|
|
|
tsk->nodelay = !!value;
|
|
|
|
tsk_set_nagle(tsk);
|
|
|
|
break;
|
2006-01-03 01:04:38 +07:00
|
|
|
default:
|
|
|
|
res = -EINVAL;
|
|
|
|
}
|
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
release_sock(sk);
|
|
|
|
|
2006-01-03 01:04:38 +07:00
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2014-02-18 15:06:46 +07:00
|
|
|
* tipc_getsockopt - get socket option
|
2006-01-03 01:04:38 +07:00
|
|
|
* @sock: socket structure
|
|
|
|
* @lvl: option level
|
|
|
|
* @opt: option identifier
|
|
|
|
* @ov: receptacle for option value
|
|
|
|
* @ol: receptacle for length of option value
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
|
|
|
* For stream sockets only, returns 0 length result for all IPPROTO_TCP options
|
2006-01-03 01:04:38 +07:00
|
|
|
* (to ease compatibility).
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Returns 0 on success, errno otherwise
|
|
|
|
*/
|
2014-02-18 15:06:46 +07:00
|
|
|
static int tipc_getsockopt(struct socket *sock, int lvl, int opt,
|
|
|
|
char __user *ov, int __user *ol)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
2008-04-15 14:22:02 +07:00
|
|
|
struct sock *sk = sock->sk;
|
2014-03-12 22:31:12 +07:00
|
|
|
struct tipc_sock *tsk = tipc_sk(sk);
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
struct tipc_name_seq seq;
|
|
|
|
int len, scope;
|
2006-01-03 01:04:38 +07:00
|
|
|
u32 value;
|
2007-02-09 21:25:21 +07:00
|
|
|
int res;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
if ((lvl == IPPROTO_TCP) && (sock->type == SOCK_STREAM))
|
|
|
|
return put_user(0, ol);
|
2006-01-03 01:04:38 +07:00
|
|
|
if (lvl != SOL_TIPC)
|
|
|
|
return -ENOPROTOOPT;
|
2011-01-01 01:59:33 +07:00
|
|
|
res = get_user(len, ol);
|
|
|
|
if (res)
|
2007-02-09 21:25:21 +07:00
|
|
|
return res;
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
lock_sock(sk);
|
2006-01-03 01:04:38 +07:00
|
|
|
|
|
|
|
switch (opt) {
|
|
|
|
case TIPC_IMPORTANCE:
|
2014-08-23 05:09:20 +07:00
|
|
|
value = tsk_importance(tsk);
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
|
|
|
case TIPC_SRC_DROPPABLE:
|
2014-08-23 05:09:20 +07:00
|
|
|
value = tsk_unreliable(tsk);
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
|
|
|
case TIPC_DEST_DROPPABLE:
|
2014-08-23 05:09:20 +07:00
|
|
|
value = tsk_unreturnable(tsk);
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
|
|
|
case TIPC_CONN_TIMEOUT:
|
2014-08-23 05:09:20 +07:00
|
|
|
value = tsk->conn_timeout;
|
2008-04-15 14:22:02 +07:00
|
|
|
/* no need to set "res", since already 0 at this point */
|
2006-01-03 01:04:38 +07:00
|
|
|
break;
|
2011-01-01 01:59:32 +07:00
|
|
|
case TIPC_NODE_RECVQ_DEPTH:
|
tipc: eliminate aggregate sk_receive_queue limit
As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
global atomic counter for the sum of sk_recv_queue sizes across all
tipc sockets. When incremented, the counter is compared to an upper
threshold value, and if this is reached, the message is rejected
with error code TIPC_OVERLOAD.
This check was originally meant to protect the node against
buffer exhaustion and general CPU overload. However, all experience
indicates that the feature not only is redundant on Linux, but even
harmful. Users run into the limit very often, causing disturbances
for their applications, while removing it seems to have no negative
effects at all. We have also seen that overall performance is
boosted significantly when this bottleneck is removed.
Furthermore, we don't see any other network protocols maintaining
such a mechanism, something strengthening our conviction that this
control can be eliminated.
As a result, the atomic variable tipc_queue_size is now unused
and so it can be deleted. There is a getsockopt call that used
to allow reading it; we retain that but just return zero for
maximum compatibility.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
[PG: phase out tipc_queue_size as pointed out by Neil Horman]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-11-27 18:15:27 +07:00
|
|
|
value = 0; /* was tipc_queue_size, now obsolete */
|
2009-06-30 10:25:39 +07:00
|
|
|
break;
|
2011-01-01 01:59:32 +07:00
|
|
|
case TIPC_SOCK_RECVQ_DEPTH:
|
2009-06-30 10:25:39 +07:00
|
|
|
value = skb_queue_len(&sk->sk_receive_queue);
|
|
|
|
break;
|
2019-04-18 21:02:19 +07:00
|
|
|
case TIPC_SOCK_RECVQ_USED:
|
|
|
|
value = sk_rmem_alloc_get(sk);
|
|
|
|
break;
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 16:04:23 +07:00
|
|
|
case TIPC_GROUP_JOIN:
|
|
|
|
seq.type = 0;
|
|
|
|
if (tsk->group)
|
|
|
|
tipc_group_self(tsk->group, &seq, &scope);
|
|
|
|
value = seq.type;
|
|
|
|
break;
|
2006-01-03 01:04:38 +07:00
|
|
|
default:
|
|
|
|
res = -EINVAL;
|
|
|
|
}
|
|
|
|
|
2008-04-15 14:22:02 +07:00
|
|
|
release_sock(sk);
|
|
|
|
|
2011-01-01 01:59:31 +07:00
|
|
|
if (res)
|
|
|
|
return res; /* "get" failed */
|
2006-01-03 01:04:38 +07:00
|
|
|
|
2011-01-01 01:59:31 +07:00
|
|
|
if (len < sizeof(value))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (copy_to_user(ov, &value, sizeof(value)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return put_user(sizeof(value), ol);
|
2006-01-03 01:04:38 +07:00
|
|
|
}
|
|
|
|
|
2015-01-09 14:27:05 +07:00
|
|
|
static int tipc_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
|
2014-04-24 21:26:47 +07:00
|
|
|
{
|
2018-04-26 00:29:36 +07:00
|
|
|
struct net *net = sock_net(sock->sk);
|
|
|
|
struct tipc_sioc_nodeid_req nr = {0};
|
2014-04-24 21:26:47 +07:00
|
|
|
struct tipc_sioc_ln_req lnr;
|
|
|
|
void __user *argp = (void __user *)arg;
|
|
|
|
|
|
|
|
switch (cmd) {
|
|
|
|
case SIOCGETLINKNAME:
|
|
|
|
if (copy_from_user(&lnr, argp, sizeof(lnr)))
|
|
|
|
return -EFAULT;
|
2018-04-26 00:29:36 +07:00
|
|
|
if (!tipc_node_get_linkname(net,
|
2015-01-09 14:27:05 +07:00
|
|
|
lnr.bearer_id & 0xffff, lnr.peer,
|
2014-04-24 21:26:47 +07:00
|
|
|
lnr.linkname, TIPC_MAX_LINK_NAME)) {
|
|
|
|
if (copy_to_user(argp, &lnr, sizeof(lnr)))
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return -EADDRNOTAVAIL;
|
2018-04-26 00:29:36 +07:00
|
|
|
case SIOCGETNODEID:
|
|
|
|
if (copy_from_user(&nr, argp, sizeof(nr)))
|
|
|
|
return -EFAULT;
|
|
|
|
if (!tipc_node_get_id(net, nr.peer, nr.node_id))
|
|
|
|
return -EADDRNOTAVAIL;
|
|
|
|
if (copy_to_user(argp, &nr, sizeof(nr)))
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
2014-04-24 21:26:47 +07:00
|
|
|
default:
|
|
|
|
return -ENOIOCTLCMD;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-03-29 16:22:16 +07:00
|
|
|
static int tipc_socketpair(struct socket *sock1, struct socket *sock2)
|
|
|
|
{
|
|
|
|
struct tipc_sock *tsk2 = tipc_sk(sock2->sk);
|
|
|
|
struct tipc_sock *tsk1 = tipc_sk(sock1->sk);
|
2017-03-29 16:22:17 +07:00
|
|
|
u32 onode = tipc_own_addr(sock_net(sock1->sk));
|
|
|
|
|
|
|
|
tsk1->peer.family = AF_TIPC;
|
|
|
|
tsk1->peer.addrtype = TIPC_ADDR_ID;
|
|
|
|
tsk1->peer.scope = TIPC_NODE_SCOPE;
|
|
|
|
tsk1->peer.addr.id.ref = tsk2->portid;
|
|
|
|
tsk1->peer.addr.id.node = onode;
|
|
|
|
tsk2->peer.family = AF_TIPC;
|
|
|
|
tsk2->peer.addrtype = TIPC_ADDR_ID;
|
|
|
|
tsk2->peer.scope = TIPC_NODE_SCOPE;
|
|
|
|
tsk2->peer.addr.id.ref = tsk1->portid;
|
|
|
|
tsk2->peer.addr.id.node = onode;
|
|
|
|
|
|
|
|
tipc_sk_finish_conn(tsk1, tsk2->portid, onode);
|
|
|
|
tipc_sk_finish_conn(tsk2, tsk1->portid, onode);
|
2017-03-29 16:22:16 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-07-10 17:55:35 +07:00
|
|
|
/* Protocol switches for the various types of TIPC sockets */
|
|
|
|
|
2008-02-08 09:18:01 +07:00
|
|
|
static const struct proto_ops msg_ops = {
|
2011-01-01 01:59:32 +07:00
|
|
|
.owner = THIS_MODULE,
|
2006-01-03 01:04:38 +07:00
|
|
|
.family = AF_TIPC,
|
2014-02-18 15:06:46 +07:00
|
|
|
.release = tipc_release,
|
|
|
|
.bind = tipc_bind,
|
|
|
|
.connect = tipc_connect,
|
2017-03-29 16:22:17 +07:00
|
|
|
.socketpair = tipc_socketpair,
|
2011-07-06 17:01:13 +07:00
|
|
|
.accept = sock_no_accept,
|
2014-02-18 15:06:46 +07:00
|
|
|
.getname = tipc_getname,
|
2018-06-28 23:43:44 +07:00
|
|
|
.poll = tipc_poll,
|
2014-04-24 21:26:47 +07:00
|
|
|
.ioctl = tipc_ioctl,
|
2011-07-06 17:01:13 +07:00
|
|
|
.listen = sock_no_listen,
|
2014-02-18 15:06:46 +07:00
|
|
|
.shutdown = tipc_shutdown,
|
|
|
|
.setsockopt = tipc_setsockopt,
|
|
|
|
.getsockopt = tipc_getsockopt,
|
|
|
|
.sendmsg = tipc_sendmsg,
|
|
|
|
.recvmsg = tipc_recvmsg,
|
2007-07-19 08:44:56 +07:00
|
|
|
.mmap = sock_no_mmap,
|
|
|
|
.sendpage = sock_no_sendpage
|
2006-01-03 01:04:38 +07:00
|
|
|
};
|
|
|
|
|
2008-02-08 09:18:01 +07:00
|
|
|
static const struct proto_ops packet_ops = {
|
2011-01-01 01:59:32 +07:00
|
|
|
.owner = THIS_MODULE,
|
2006-01-03 01:04:38 +07:00
|
|
|
.family = AF_TIPC,
|
2014-02-18 15:06:46 +07:00
|
|
|
.release = tipc_release,
|
|
|
|
.bind = tipc_bind,
|
|
|
|
.connect = tipc_connect,
|
2017-03-29 16:22:16 +07:00
|
|
|
.socketpair = tipc_socketpair,
|
2014-02-18 15:06:46 +07:00
|
|
|
.accept = tipc_accept,
|
|
|
|
.getname = tipc_getname,
|
2018-06-28 23:43:44 +07:00
|
|
|
.poll = tipc_poll,
|
2014-04-24 21:26:47 +07:00
|
|
|
.ioctl = tipc_ioctl,
|
2014-02-18 15:06:46 +07:00
|
|
|
.listen = tipc_listen,
|
|
|
|
.shutdown = tipc_shutdown,
|
|
|
|
.setsockopt = tipc_setsockopt,
|
|
|
|
.getsockopt = tipc_getsockopt,
|
|
|
|
.sendmsg = tipc_send_packet,
|
|
|
|
.recvmsg = tipc_recvmsg,
|
2007-07-19 08:44:56 +07:00
|
|
|
.mmap = sock_no_mmap,
|
|
|
|
.sendpage = sock_no_sendpage
|
2006-01-03 01:04:38 +07:00
|
|
|
};
|
|
|
|
|
2008-02-08 09:18:01 +07:00
|
|
|
static const struct proto_ops stream_ops = {
|
2011-01-01 01:59:32 +07:00
|
|
|
.owner = THIS_MODULE,
|
2006-01-03 01:04:38 +07:00
|
|
|
.family = AF_TIPC,
|
2014-02-18 15:06:46 +07:00
|
|
|
.release = tipc_release,
|
|
|
|
.bind = tipc_bind,
|
|
|
|
.connect = tipc_connect,
|
2017-03-29 16:22:16 +07:00
|
|
|
.socketpair = tipc_socketpair,
|
2014-02-18 15:06:46 +07:00
|
|
|
.accept = tipc_accept,
|
|
|
|
.getname = tipc_getname,
|
2018-06-28 23:43:44 +07:00
|
|
|
.poll = tipc_poll,
|
2014-04-24 21:26:47 +07:00
|
|
|
.ioctl = tipc_ioctl,
|
2014-02-18 15:06:46 +07:00
|
|
|
.listen = tipc_listen,
|
|
|
|
.shutdown = tipc_shutdown,
|
|
|
|
.setsockopt = tipc_setsockopt,
|
|
|
|
.getsockopt = tipc_getsockopt,
|
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 22:55:11 +07:00
|
|
|
.sendmsg = tipc_sendstream,
|
2017-05-02 23:16:54 +07:00
|
|
|
.recvmsg = tipc_recvstream,
|
2007-07-19 08:44:56 +07:00
|
|
|
.mmap = sock_no_mmap,
|
|
|
|
.sendpage = sock_no_sendpage
|
2006-01-03 01:04:38 +07:00
|
|
|
};
|
|
|
|
|
2008-02-08 09:18:01 +07:00
|
|
|
static const struct net_proto_family tipc_family_ops = {
|
2011-01-01 01:59:32 +07:00
|
|
|
.owner = THIS_MODULE,
|
2006-01-03 01:04:38 +07:00
|
|
|
.family = AF_TIPC,
|
tipc: introduce new TIPC server infrastructure
TIPC has two internal servers, one providing a subscription
service for topology events, and another providing the
configuration interface. These servers have previously been running
in BH context, accessing the TIPC-port (aka native) API directly.
Apart from these servers, even the TIPC socket implementation is
partially built on this API.
As this API may simultaneously be called via different paths and in
different contexts, a complex and costly lock policiy is required
in order to protect TIPC internal resources.
To eliminate the need for this complex lock policiy, we introduce
a new, generic service API that uses kernel sockets for message
passing instead of the native API. Once the toplogy and configuration
servers are converted to use this new service, all code pertaining
to the native API can be removed. This entails a significant
reduction in code amount and complexity, and opens up for a complete
rework of the locking policy in TIPC.
The new service also solves another problem:
As the current topology server works in BH context, it cannot easily
be blocked when sending of events fails due to congestion. In such
cases events may have to be silently dropped, something that is
unacceptable. Therefore, the new service keeps a dedicated outbound
queue receiving messages from BH context. Once messages are
inserted into this queue, we will immediately schedule a work from a
special workqueue. This way, messages/events from the topology server
are in reality sent in process context, and the server can block
if necessary.
Analogously, there is a new workqueue for receiving messages. Once a
notification about an arriving message is received in BH context, we
schedule a work from the receive workqueue to do the job of
receiving the message in process context.
As both sending and receive messages are now finished in processes,
subscribed events cannot be dropped any more.
As of this commit, this new server infrastructure is built, but
not actually yet called by the existing TIPC code, but since the
conversion changes required in order to use it are significant,
the addition is kept here as a separate commit.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 21:54:39 +07:00
|
|
|
.create = tipc_sk_create
|
2006-01-03 01:04:38 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct proto tipc_proto = {
|
|
|
|
.name = "TIPC",
|
|
|
|
.owner = THIS_MODULE,
|
2013-06-17 21:54:37 +07:00
|
|
|
.obj_size = sizeof(struct tipc_sock),
|
|
|
|
.sysctl_rmem = sysctl_tipc_rmem
|
2006-01-03 01:04:38 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
/**
|
2006-01-18 06:38:21 +07:00
|
|
|
* tipc_socket_init - initialize TIPC socket interface
|
2007-02-09 21:25:21 +07:00
|
|
|
*
|
2006-01-03 01:04:38 +07:00
|
|
|
* Returns 0 on success, errno otherwise
|
|
|
|
*/
|
2006-01-18 06:38:21 +07:00
|
|
|
int tipc_socket_init(void)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
|
|
|
int res;
|
|
|
|
|
2007-02-09 21:25:21 +07:00
|
|
|
res = proto_register(&tipc_proto, 1);
|
2006-01-03 01:04:38 +07:00
|
|
|
if (res) {
|
2012-06-29 11:16:37 +07:00
|
|
|
pr_err("Failed to register TIPC protocol type\n");
|
2006-01-03 01:04:38 +07:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
res = sock_register(&tipc_family_ops);
|
|
|
|
if (res) {
|
2012-06-29 11:16:37 +07:00
|
|
|
pr_err("Failed to register TIPC socket type\n");
|
2006-01-03 01:04:38 +07:00
|
|
|
proto_unregister(&tipc_proto);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2006-01-18 06:38:21 +07:00
|
|
|
* tipc_socket_stop - stop TIPC socket interface
|
2006-01-03 01:04:38 +07:00
|
|
|
*/
|
2006-01-18 06:38:21 +07:00
|
|
|
void tipc_socket_stop(void)
|
2006-01-03 01:04:38 +07:00
|
|
|
{
|
|
|
|
sock_unregister(tipc_family_ops.family);
|
|
|
|
proto_unregister(&tipc_proto);
|
|
|
|
}
|
2014-11-20 16:29:10 +07:00
|
|
|
|
|
|
|
/* Caller should hold socket lock for the passed tipc socket. */
|
2014-11-24 17:10:29 +07:00
|
|
|
static int __tipc_nl_add_sk_con(struct sk_buff *skb, struct tipc_sock *tsk)
|
2014-11-20 16:29:10 +07:00
|
|
|
{
|
|
|
|
u32 peer_node;
|
|
|
|
u32 peer_port;
|
|
|
|
struct nlattr *nest;
|
|
|
|
|
|
|
|
peer_node = tsk_peer_node(tsk);
|
|
|
|
peer_port = tsk_peer_port(tsk);
|
|
|
|
|
2019-04-26 16:13:06 +07:00
|
|
|
nest = nla_nest_start_noflag(skb, TIPC_NLA_SOCK_CON);
|
2019-03-17 04:46:05 +07:00
|
|
|
if (!nest)
|
|
|
|
return -EMSGSIZE;
|
2014-11-20 16:29:10 +07:00
|
|
|
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_CON_NODE, peer_node))
|
|
|
|
goto msg_full;
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_CON_SOCK, peer_port))
|
|
|
|
goto msg_full;
|
|
|
|
|
|
|
|
if (tsk->conn_type != 0) {
|
|
|
|
if (nla_put_flag(skb, TIPC_NLA_CON_FLAG))
|
|
|
|
goto msg_full;
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_CON_TYPE, tsk->conn_type))
|
|
|
|
goto msg_full;
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_CON_INST, tsk->conn_instance))
|
|
|
|
goto msg_full;
|
|
|
|
}
|
|
|
|
nla_nest_end(skb, nest);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
msg_full:
|
|
|
|
nla_nest_cancel(skb, nest);
|
|
|
|
|
|
|
|
return -EMSGSIZE;
|
|
|
|
}
|
|
|
|
|
2018-03-21 20:37:43 +07:00
|
|
|
static int __tipc_nl_add_sk_info(struct sk_buff *skb, struct tipc_sock
|
|
|
|
*tsk)
|
|
|
|
{
|
|
|
|
struct net *net = sock_net(skb->sk);
|
|
|
|
struct sock *sk = &tsk->sk;
|
|
|
|
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_SOCK_REF, tsk->portid) ||
|
2018-03-23 02:42:49 +07:00
|
|
|
nla_put_u32(skb, TIPC_NLA_SOCK_ADDR, tipc_own_addr(net)))
|
2018-03-21 20:37:43 +07:00
|
|
|
return -EMSGSIZE;
|
|
|
|
|
|
|
|
if (tipc_sk_connected(sk)) {
|
|
|
|
if (__tipc_nl_add_sk_con(skb, tsk))
|
|
|
|
return -EMSGSIZE;
|
|
|
|
} else if (!list_empty(&tsk->publications)) {
|
|
|
|
if (nla_put_flag(skb, TIPC_NLA_SOCK_HAS_PUBL))
|
|
|
|
return -EMSGSIZE;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-11-20 16:29:10 +07:00
|
|
|
/* Caller should hold socket lock for the passed tipc socket. */
|
2014-11-24 17:10:29 +07:00
|
|
|
static int __tipc_nl_add_sk(struct sk_buff *skb, struct netlink_callback *cb,
|
|
|
|
struct tipc_sock *tsk)
|
2014-11-20 16:29:10 +07:00
|
|
|
{
|
|
|
|
struct nlattr *attrs;
|
2018-03-21 20:37:43 +07:00
|
|
|
void *hdr;
|
2014-11-20 16:29:10 +07:00
|
|
|
|
|
|
|
hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
|
2015-02-09 15:50:03 +07:00
|
|
|
&tipc_genl_family, NLM_F_MULTI, TIPC_NL_SOCK_GET);
|
2014-11-20 16:29:10 +07:00
|
|
|
if (!hdr)
|
|
|
|
goto msg_cancel;
|
|
|
|
|
2019-04-26 16:13:06 +07:00
|
|
|
attrs = nla_nest_start_noflag(skb, TIPC_NLA_SOCK);
|
2014-11-20 16:29:10 +07:00
|
|
|
if (!attrs)
|
|
|
|
goto genlmsg_cancel;
|
2018-03-21 20:37:43 +07:00
|
|
|
|
|
|
|
if (__tipc_nl_add_sk_info(skb, tsk))
|
2014-11-20 16:29:10 +07:00
|
|
|
goto attr_msg_cancel;
|
|
|
|
|
|
|
|
nla_nest_end(skb, attrs);
|
|
|
|
genlmsg_end(skb, hdr);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
attr_msg_cancel:
|
|
|
|
nla_nest_cancel(skb, attrs);
|
|
|
|
genlmsg_cancel:
|
|
|
|
genlmsg_cancel(skb, hdr);
|
|
|
|
msg_cancel:
|
|
|
|
return -EMSGSIZE;
|
|
|
|
}
|
|
|
|
|
2018-03-21 20:37:44 +07:00
|
|
|
int tipc_nl_sk_walk(struct sk_buff *skb, struct netlink_callback *cb,
|
|
|
|
int (*skb_handler)(struct sk_buff *skb,
|
|
|
|
struct netlink_callback *cb,
|
|
|
|
struct tipc_sock *tsk))
|
2014-11-20 16:29:10 +07:00
|
|
|
{
|
2018-09-05 04:54:55 +07:00
|
|
|
struct rhashtable_iter *iter = (void *)cb->args[4];
|
2018-03-21 20:37:43 +07:00
|
|
|
struct tipc_sock *tsk;
|
|
|
|
int err;
|
2014-11-20 16:29:10 +07:00
|
|
|
|
2018-08-25 02:28:06 +07:00
|
|
|
rhashtable_walk_start(iter);
|
|
|
|
while ((tsk = rhashtable_walk_next(iter)) != NULL) {
|
|
|
|
if (IS_ERR(tsk)) {
|
|
|
|
err = PTR_ERR(tsk);
|
|
|
|
if (err == -EAGAIN) {
|
|
|
|
err = 0;
|
2015-01-16 18:30:40 +07:00
|
|
|
continue;
|
|
|
|
}
|
2018-08-25 02:28:06 +07:00
|
|
|
break;
|
|
|
|
}
|
2015-01-16 18:30:40 +07:00
|
|
|
|
2018-08-25 02:28:06 +07:00
|
|
|
sock_hold(&tsk->sk);
|
|
|
|
rhashtable_walk_stop(iter);
|
|
|
|
lock_sock(&tsk->sk);
|
|
|
|
err = skb_handler(skb, cb, tsk);
|
|
|
|
if (err) {
|
|
|
|
release_sock(&tsk->sk);
|
|
|
|
sock_put(&tsk->sk);
|
|
|
|
goto out;
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
}
|
2018-08-25 02:28:06 +07:00
|
|
|
release_sock(&tsk->sk);
|
|
|
|
rhashtable_walk_start(iter);
|
|
|
|
sock_put(&tsk->sk);
|
2014-11-20 16:29:10 +07:00
|
|
|
}
|
2018-08-25 02:28:06 +07:00
|
|
|
rhashtable_walk_stop(iter);
|
2015-01-16 18:30:40 +07:00
|
|
|
out:
|
2014-11-20 16:29:10 +07:00
|
|
|
return skb->len;
|
|
|
|
}
|
2018-03-21 20:37:44 +07:00
|
|
|
EXPORT_SYMBOL(tipc_nl_sk_walk);
|
|
|
|
|
2018-08-25 02:28:06 +07:00
|
|
|
int tipc_dump_start(struct netlink_callback *cb)
|
|
|
|
{
|
2018-09-05 04:54:55 +07:00
|
|
|
return __tipc_dump_start(cb, sock_net(cb->skb->sk));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(tipc_dump_start);
|
|
|
|
|
|
|
|
int __tipc_dump_start(struct netlink_callback *cb, struct net *net)
|
|
|
|
{
|
|
|
|
/* tipc_nl_name_table_dump() uses cb->args[0...3]. */
|
|
|
|
struct rhashtable_iter *iter = (void *)cb->args[4];
|
2018-08-25 02:28:06 +07:00
|
|
|
struct tipc_net *tn = tipc_net(net);
|
|
|
|
|
|
|
|
if (!iter) {
|
|
|
|
iter = kmalloc(sizeof(*iter), GFP_KERNEL);
|
|
|
|
if (!iter)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2018-09-05 04:54:55 +07:00
|
|
|
cb->args[4] = (long)iter;
|
2018-08-25 02:28:06 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
rhashtable_walk_enter(&tn->sk_rht, iter);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int tipc_dump_done(struct netlink_callback *cb)
|
|
|
|
{
|
2018-09-05 04:54:55 +07:00
|
|
|
struct rhashtable_iter *hti = (void *)cb->args[4];
|
2018-08-25 02:28:06 +07:00
|
|
|
|
|
|
|
rhashtable_walk_exit(hti);
|
|
|
|
kfree(hti);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(tipc_dump_done);
|
|
|
|
|
2018-04-07 08:54:52 +07:00
|
|
|
int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct netlink_callback *cb,
|
|
|
|
struct tipc_sock *tsk, u32 sk_filter_state,
|
2018-03-21 20:37:44 +07:00
|
|
|
u64 (*tipc_diag_gen_cookie)(struct sock *sk))
|
|
|
|
{
|
|
|
|
struct sock *sk = &tsk->sk;
|
|
|
|
struct nlattr *attrs;
|
|
|
|
struct nlattr *stat;
|
|
|
|
|
|
|
|
/*filter response w.r.t sk_state*/
|
|
|
|
if (!(sk_filter_state & (1 << sk->sk_state)))
|
|
|
|
return 0;
|
|
|
|
|
2019-04-26 16:13:06 +07:00
|
|
|
attrs = nla_nest_start_noflag(skb, TIPC_NLA_SOCK);
|
2018-03-21 20:37:44 +07:00
|
|
|
if (!attrs)
|
|
|
|
goto msg_cancel;
|
|
|
|
|
|
|
|
if (__tipc_nl_add_sk_info(skb, tsk))
|
|
|
|
goto attr_msg_cancel;
|
|
|
|
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_SOCK_TYPE, (u32)sk->sk_type) ||
|
|
|
|
nla_put_u32(skb, TIPC_NLA_SOCK_TIPC_STATE, (u32)sk->sk_state) ||
|
|
|
|
nla_put_u32(skb, TIPC_NLA_SOCK_INO, sock_i_ino(sk)) ||
|
|
|
|
nla_put_u32(skb, TIPC_NLA_SOCK_UID,
|
2018-04-07 08:54:52 +07:00
|
|
|
from_kuid_munged(sk_user_ns(NETLINK_CB(cb->skb).sk),
|
2018-04-04 19:49:47 +07:00
|
|
|
sock_i_uid(sk))) ||
|
2018-03-21 20:37:44 +07:00
|
|
|
nla_put_u64_64bit(skb, TIPC_NLA_SOCK_COOKIE,
|
|
|
|
tipc_diag_gen_cookie(sk),
|
|
|
|
TIPC_NLA_SOCK_PAD))
|
|
|
|
goto attr_msg_cancel;
|
|
|
|
|
2019-04-26 16:13:06 +07:00
|
|
|
stat = nla_nest_start_noflag(skb, TIPC_NLA_SOCK_STAT);
|
2018-03-21 20:37:44 +07:00
|
|
|
if (!stat)
|
|
|
|
goto attr_msg_cancel;
|
|
|
|
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_SOCK_STAT_RCVQ,
|
|
|
|
skb_queue_len(&sk->sk_receive_queue)) ||
|
|
|
|
nla_put_u32(skb, TIPC_NLA_SOCK_STAT_SENDQ,
|
2018-03-21 20:37:45 +07:00
|
|
|
skb_queue_len(&sk->sk_write_queue)) ||
|
|
|
|
nla_put_u32(skb, TIPC_NLA_SOCK_STAT_DROP,
|
|
|
|
atomic_read(&sk->sk_drops)))
|
2018-03-21 20:37:44 +07:00
|
|
|
goto stat_msg_cancel;
|
|
|
|
|
|
|
|
if (tsk->cong_link_cnt &&
|
|
|
|
nla_put_flag(skb, TIPC_NLA_SOCK_STAT_LINK_CONG))
|
|
|
|
goto stat_msg_cancel;
|
|
|
|
|
|
|
|
if (tsk_conn_cong(tsk) &&
|
|
|
|
nla_put_flag(skb, TIPC_NLA_SOCK_STAT_CONN_CONG))
|
|
|
|
goto stat_msg_cancel;
|
|
|
|
|
|
|
|
nla_nest_end(skb, stat);
|
2018-06-29 18:26:18 +07:00
|
|
|
|
|
|
|
if (tsk->group)
|
|
|
|
if (tipc_group_fill_sock_diag(tsk->group, skb))
|
|
|
|
goto stat_msg_cancel;
|
|
|
|
|
2018-03-21 20:37:44 +07:00
|
|
|
nla_nest_end(skb, attrs);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
stat_msg_cancel:
|
|
|
|
nla_nest_cancel(skb, stat);
|
|
|
|
attr_msg_cancel:
|
|
|
|
nla_nest_cancel(skb, attrs);
|
|
|
|
msg_cancel:
|
|
|
|
return -EMSGSIZE;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(tipc_sk_fill_sock_diag);
|
2014-11-20 16:29:11 +07:00
|
|
|
|
2018-03-21 20:37:43 +07:00
|
|
|
int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
|
|
|
|
{
|
2018-03-21 20:37:44 +07:00
|
|
|
return tipc_nl_sk_walk(skb, cb, __tipc_nl_add_sk);
|
2018-03-21 20:37:43 +07:00
|
|
|
}
|
|
|
|
|
2014-11-20 16:29:11 +07:00
|
|
|
/* Caller should hold socket lock for the passed tipc socket. */
|
2014-11-24 17:10:29 +07:00
|
|
|
static int __tipc_nl_add_sk_publ(struct sk_buff *skb,
|
|
|
|
struct netlink_callback *cb,
|
|
|
|
struct publication *publ)
|
2014-11-20 16:29:11 +07:00
|
|
|
{
|
|
|
|
void *hdr;
|
|
|
|
struct nlattr *attrs;
|
|
|
|
|
|
|
|
hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
|
2015-02-09 15:50:03 +07:00
|
|
|
&tipc_genl_family, NLM_F_MULTI, TIPC_NL_PUBL_GET);
|
2014-11-20 16:29:11 +07:00
|
|
|
if (!hdr)
|
|
|
|
goto msg_cancel;
|
|
|
|
|
2019-04-26 16:13:06 +07:00
|
|
|
attrs = nla_nest_start_noflag(skb, TIPC_NLA_PUBL);
|
2014-11-20 16:29:11 +07:00
|
|
|
if (!attrs)
|
|
|
|
goto genlmsg_cancel;
|
|
|
|
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_PUBL_KEY, publ->key))
|
|
|
|
goto attr_msg_cancel;
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_PUBL_TYPE, publ->type))
|
|
|
|
goto attr_msg_cancel;
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_PUBL_LOWER, publ->lower))
|
|
|
|
goto attr_msg_cancel;
|
|
|
|
if (nla_put_u32(skb, TIPC_NLA_PUBL_UPPER, publ->upper))
|
|
|
|
goto attr_msg_cancel;
|
|
|
|
|
|
|
|
nla_nest_end(skb, attrs);
|
|
|
|
genlmsg_end(skb, hdr);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
attr_msg_cancel:
|
|
|
|
nla_nest_cancel(skb, attrs);
|
|
|
|
genlmsg_cancel:
|
|
|
|
genlmsg_cancel(skb, hdr);
|
|
|
|
msg_cancel:
|
|
|
|
return -EMSGSIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Caller should hold socket lock for the passed tipc socket. */
|
2014-11-24 17:10:29 +07:00
|
|
|
static int __tipc_nl_list_sk_publ(struct sk_buff *skb,
|
|
|
|
struct netlink_callback *cb,
|
|
|
|
struct tipc_sock *tsk, u32 *last_publ)
|
2014-11-20 16:29:11 +07:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
struct publication *p;
|
|
|
|
|
|
|
|
if (*last_publ) {
|
2018-03-15 22:48:55 +07:00
|
|
|
list_for_each_entry(p, &tsk->publications, binding_sock) {
|
2014-11-20 16:29:11 +07:00
|
|
|
if (p->key == *last_publ)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (p->key != *last_publ) {
|
|
|
|
/* We never set seq or call nl_dump_check_consistent()
|
|
|
|
* this means that setting prev_seq here will cause the
|
|
|
|
* consistence check to fail in the netlink callback
|
|
|
|
* handler. Resulting in the last NLMSG_DONE message
|
|
|
|
* having the NLM_F_DUMP_INTR flag set.
|
|
|
|
*/
|
|
|
|
cb->prev_seq = 1;
|
|
|
|
*last_publ = 0;
|
|
|
|
return -EPIPE;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
p = list_first_entry(&tsk->publications, struct publication,
|
2018-03-15 22:48:55 +07:00
|
|
|
binding_sock);
|
2014-11-20 16:29:11 +07:00
|
|
|
}
|
|
|
|
|
2018-03-15 22:48:55 +07:00
|
|
|
list_for_each_entry_from(p, &tsk->publications, binding_sock) {
|
2014-11-20 16:29:11 +07:00
|
|
|
err = __tipc_nl_add_sk_publ(skb, cb, p);
|
|
|
|
if (err) {
|
|
|
|
*last_publ = p->key;
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
*last_publ = 0;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int tipc_nl_publ_dump(struct sk_buff *skb, struct netlink_callback *cb)
|
|
|
|
{
|
|
|
|
int err;
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
u32 tsk_portid = cb->args[0];
|
2014-11-20 16:29:11 +07:00
|
|
|
u32 last_publ = cb->args[1];
|
|
|
|
u32 done = cb->args[2];
|
2015-01-09 14:27:08 +07:00
|
|
|
struct net *net = sock_net(skb->sk);
|
2014-11-20 16:29:11 +07:00
|
|
|
struct tipc_sock *tsk;
|
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
if (!tsk_portid) {
|
2019-10-06 01:04:39 +07:00
|
|
|
struct nlattr **attrs = genl_dumpit_info(cb)->attrs;
|
2014-11-20 16:29:11 +07:00
|
|
|
struct nlattr *sock[TIPC_NLA_SOCK_MAX + 1];
|
|
|
|
|
2016-05-16 16:14:54 +07:00
|
|
|
if (!attrs[TIPC_NLA_SOCK])
|
|
|
|
return -EINVAL;
|
|
|
|
|
netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:
1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size
The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().
Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.
We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated
Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)
@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.
Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.
Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.
In effect then, this adds fully strict validation for any new command.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 19:07:28 +07:00
|
|
|
err = nla_parse_nested_deprecated(sock, TIPC_NLA_SOCK_MAX,
|
|
|
|
attrs[TIPC_NLA_SOCK],
|
|
|
|
tipc_nl_sock_policy, NULL);
|
2014-11-20 16:29:11 +07:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
if (!sock[TIPC_NLA_SOCK_REF])
|
|
|
|
return -EINVAL;
|
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
tsk_portid = nla_get_u32(sock[TIPC_NLA_SOCK_REF]);
|
2014-11-20 16:29:11 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (done)
|
|
|
|
return 0;
|
|
|
|
|
2015-01-09 14:27:08 +07:00
|
|
|
tsk = tipc_sk_lookup(net, tsk_portid);
|
2014-11-20 16:29:11 +07:00
|
|
|
if (!tsk)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
lock_sock(&tsk->sk);
|
|
|
|
err = __tipc_nl_list_sk_publ(skb, cb, tsk, &last_publ);
|
|
|
|
if (!err)
|
|
|
|
done = 1;
|
|
|
|
release_sock(&tsk->sk);
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
sock_put(&tsk->sk);
|
2014-11-20 16:29:11 +07:00
|
|
|
|
tipc: convert tipc reference table to use generic rhashtable
As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.
If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.
Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07 12:41:58 +07:00
|
|
|
cb->args[0] = tsk_portid;
|
2014-11-20 16:29:11 +07:00
|
|
|
cb->args[1] = last_publ;
|
|
|
|
cb->args[2] = done;
|
|
|
|
|
|
|
|
return skb->len;
|
|
|
|
}
|
tipc: enable tracepoints in tipc
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:56 +07:00
|
|
|
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
/**
|
|
|
|
* tipc_sk_filtering - check if a socket should be traced
|
|
|
|
* @sk: the socket to be examined
|
|
|
|
* @sysctl_tipc_sk_filter[]: the socket tuple for filtering,
|
|
|
|
* (portid, sock type, name type, name lower, name upper)
|
|
|
|
*
|
|
|
|
* Returns true if the socket meets the socket tuple data
|
|
|
|
* (value 0 = 'any') or when there is no tuple set (all = 0),
|
|
|
|
* otherwise false
|
|
|
|
*/
|
|
|
|
bool tipc_sk_filtering(struct sock *sk)
|
|
|
|
{
|
|
|
|
struct tipc_sock *tsk;
|
|
|
|
struct publication *p;
|
|
|
|
u32 _port, _sktype, _type, _lower, _upper;
|
|
|
|
u32 type = 0, lower = 0, upper = 0;
|
|
|
|
|
|
|
|
if (!sk)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
tsk = tipc_sk(sk);
|
|
|
|
|
|
|
|
_port = sysctl_tipc_sk_filter[0];
|
|
|
|
_sktype = sysctl_tipc_sk_filter[1];
|
|
|
|
_type = sysctl_tipc_sk_filter[2];
|
|
|
|
_lower = sysctl_tipc_sk_filter[3];
|
|
|
|
_upper = sysctl_tipc_sk_filter[4];
|
|
|
|
|
|
|
|
if (!_port && !_sktype && !_type && !_lower && !_upper)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (_port)
|
|
|
|
return (_port == tsk->portid);
|
|
|
|
|
|
|
|
if (_sktype && _sktype != sk->sk_type)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (tsk->published) {
|
|
|
|
p = list_first_entry_or_null(&tsk->publications,
|
|
|
|
struct publication, binding_sock);
|
|
|
|
if (p) {
|
|
|
|
type = p->type;
|
|
|
|
lower = p->lower;
|
|
|
|
upper = p->upper;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!tipc_sk_type_connectionless(sk)) {
|
|
|
|
type = tsk->conn_type;
|
|
|
|
lower = tsk->conn_instance;
|
|
|
|
upper = tsk->conn_instance;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((_type && _type != type) || (_lower && _lower != lower) ||
|
|
|
|
(_upper && _upper != upper))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
tipc: enable tracepoints in tipc
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:56 +07:00
|
|
|
u32 tipc_sock_get_portid(struct sock *sk)
|
|
|
|
{
|
|
|
|
return (sk) ? (tipc_sk(sk))->portid : 0;
|
|
|
|
}
|
|
|
|
|
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:58 +07:00
|
|
|
/**
|
|
|
|
* tipc_sk_overlimit1 - check if socket rx queue is about to be overloaded,
|
|
|
|
* both the rcv and backlog queues are considered
|
|
|
|
* @sk: tipc sk to be checked
|
|
|
|
* @skb: tipc msg to be checked
|
|
|
|
*
|
|
|
|
* Returns true if the socket rx queue allocation is > 90%, otherwise false
|
|
|
|
*/
|
|
|
|
|
|
|
|
bool tipc_sk_overlimit1(struct sock *sk, struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
atomic_t *dcnt = &tipc_sk(sk)->dupl_rcvcnt;
|
|
|
|
unsigned int lim = rcvbuf_limit(sk, skb) + atomic_read(dcnt);
|
|
|
|
unsigned int qsize = sk->sk_backlog.len + sk_rmem_alloc_get(sk);
|
|
|
|
|
|
|
|
return (qsize > lim * 90 / 100);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* tipc_sk_overlimit2 - check if socket rx queue is about to be overloaded,
|
|
|
|
* only the rcv queue is considered
|
|
|
|
* @sk: tipc sk to be checked
|
|
|
|
* @skb: tipc msg to be checked
|
|
|
|
*
|
|
|
|
* Returns true if the socket rx queue allocation is > 90%, otherwise false
|
|
|
|
*/
|
|
|
|
|
|
|
|
bool tipc_sk_overlimit2(struct sock *sk, struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
unsigned int lim = rcvbuf_limit(sk, skb);
|
|
|
|
unsigned int qsize = sk_rmem_alloc_get(sk);
|
|
|
|
|
|
|
|
return (qsize > lim * 90 / 100);
|
|
|
|
}
|
|
|
|
|
tipc: enable tracepoints in tipc
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:56 +07:00
|
|
|
/**
|
|
|
|
* tipc_sk_dump - dump TIPC socket
|
|
|
|
* @sk: tipc sk to be dumped
|
|
|
|
* @dqueues: bitmask to decide if any socket queue to be dumped?
|
|
|
|
* - TIPC_DUMP_NONE: don't dump socket queues
|
|
|
|
* - TIPC_DUMP_SK_SNDQ: dump socket send queue
|
|
|
|
* - TIPC_DUMP_SK_RCVQ: dump socket rcv queue
|
|
|
|
* - TIPC_DUMP_SK_BKLGQ: dump socket backlog queue
|
|
|
|
* - TIPC_DUMP_ALL: dump all the socket queues above
|
|
|
|
* @buf: returned buffer of dump data in format
|
|
|
|
*/
|
|
|
|
int tipc_sk_dump(struct sock *sk, u16 dqueues, char *buf)
|
|
|
|
{
|
|
|
|
int i = 0;
|
|
|
|
size_t sz = (dqueues) ? SK_LMAX : SK_LMIN;
|
|
|
|
struct tipc_sock *tsk;
|
|
|
|
struct publication *p;
|
|
|
|
bool tsk_connected;
|
|
|
|
|
|
|
|
if (!sk) {
|
|
|
|
i += scnprintf(buf, sz, "sk data: (null)\n");
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
|
|
|
tsk = tipc_sk(sk);
|
|
|
|
tsk_connected = !tipc_sk_type_connectionless(sk);
|
|
|
|
|
|
|
|
i += scnprintf(buf, sz, "sk data: %u", sk->sk_type);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %d", sk->sk_state);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %x", tsk_own_node(tsk));
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", tsk->portid);
|
|
|
|
i += scnprintf(buf + i, sz - i, " | %u", tsk_connected);
|
|
|
|
if (tsk_connected) {
|
|
|
|
i += scnprintf(buf + i, sz - i, " %x", tsk_peer_node(tsk));
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", tsk_peer_port(tsk));
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", tsk->conn_type);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", tsk->conn_instance);
|
|
|
|
}
|
|
|
|
i += scnprintf(buf + i, sz - i, " | %u", tsk->published);
|
|
|
|
if (tsk->published) {
|
|
|
|
p = list_first_entry_or_null(&tsk->publications,
|
|
|
|
struct publication, binding_sock);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", (p) ? p->type : 0);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", (p) ? p->lower : 0);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", (p) ? p->upper : 0);
|
|
|
|
}
|
|
|
|
i += scnprintf(buf + i, sz - i, " | %u", tsk->snd_win);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", tsk->rcv_win);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", tsk->max_pkt);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %x", tsk->peer_caps);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", tsk->cong_link_cnt);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", tsk->snt_unacked);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", tsk->rcv_unacked);
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", atomic_read(&tsk->dupl_rcvcnt));
|
|
|
|
i += scnprintf(buf + i, sz - i, " %u", sk->sk_shutdown);
|
|
|
|
i += scnprintf(buf + i, sz - i, " | %d", sk_wmem_alloc_get(sk));
|
|
|
|
i += scnprintf(buf + i, sz - i, " %d", sk->sk_sndbuf);
|
|
|
|
i += scnprintf(buf + i, sz - i, " | %d", sk_rmem_alloc_get(sk));
|
|
|
|
i += scnprintf(buf + i, sz - i, " %d", sk->sk_rcvbuf);
|
2019-10-10 05:41:03 +07:00
|
|
|
i += scnprintf(buf + i, sz - i, " | %d\n", READ_ONCE(sk->sk_backlog.len));
|
tipc: enable tracepoints in tipc
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:56 +07:00
|
|
|
|
|
|
|
if (dqueues & TIPC_DUMP_SK_SNDQ) {
|
|
|
|
i += scnprintf(buf + i, sz - i, "sk_write_queue: ");
|
|
|
|
i += tipc_list_dump(&sk->sk_write_queue, false, buf + i);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dqueues & TIPC_DUMP_SK_RCVQ) {
|
|
|
|
i += scnprintf(buf + i, sz - i, "sk_receive_queue: ");
|
|
|
|
i += tipc_list_dump(&sk->sk_receive_queue, false, buf + i);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dqueues & TIPC_DUMP_SK_BKLGQ) {
|
|
|
|
i += scnprintf(buf + i, sz - i, "sk_backlog:\n head ");
|
|
|
|
i += tipc_skb_dump(sk->sk_backlog.head, false, buf + i);
|
|
|
|
if (sk->sk_backlog.tail != sk->sk_backlog.head) {
|
|
|
|
i += scnprintf(buf + i, sz - i, " tail ");
|
|
|
|
i += tipc_skb_dump(sk->sk_backlog.tail, false,
|
|
|
|
buf + i);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return i;
|
|
|
|
}
|