Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#
set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done
The script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_ATOMIC is an alias of GFP_ATOMIC
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conflicts:
drivers/ata/libata-scsi.c
include/linux/libata.h
Futher merge of Linus's head and compilation fixups.
Signed-Off-By: David Howells <dhowells@redhat.com>
Conflicts:
drivers/infiniband/core/iwcm.c
drivers/net/chelsio/cxgb2.c
drivers/net/wireless/bcm43xx/bcm43xx_main.c
drivers/net/wireless/prism54/islpci_eth.c
drivers/usb/core/hub.h
drivers/usb/input/hid-core.c
net/core/netpoll.c
Fix up merge failures with Linus's head and fix new compilation failures.
Signed-Off-By: David Howells <dhowells@redhat.com>
This replaces the linear search algorithm for reverse lookup with
binary search.
It has the advantage of better scalability: O(log2(N)) instead of O(N).
This means that the average number of iterations is reduced from 250
(linear search if each value appears equally likely) down to at most 9.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch deprecates the existing use of an arbitrary value TFRC_SMALLEST_P
for low-threshold values of p. This avoids masking low-resolution errors.
Instead, the code now checks against real boundaries (implemented by preceding
patch) and provides warnings whenever a real value falls below the threshold.
If such messages are observed, it is a better solution to take this as an
indication that the lookup table needs to be re-engineered.
Changelog:
----------
This patch
* makes handling all TFRC resolution errors local to the TFRC library
* removes unnecessary test whether X_calc is 'infinity' due to p==0 -- this
condition is already caught by tfrc_calc_x()
* removes setting ccid3hctx_p = TFRC_SMALLEST_P in ccid3_hc_tx_packet_recv
since this is now done by the TFRC library
* updates BUG_ON test in ccid3_hc_tx_no_feedback_timer to take into account
that p now is either 0 (and then X_calc is irrelevant), or it is > 0; since
the handling of TFRC_SMALLEST_P is now taken care of in the tfrc library
Justification:
--------------
The TFRC code uses a lookup table which has a bounded resolution.
The lowest possible value of the loss event rate `p' which can be
resolved is currently 0.0001. Substituting this lower threshold for
p when p is less than 0.0001 results in a huge, exponentially-growing
error. The error can be computed by the following formula:
(f(0.0001) - f(p))/f(p) * 100 for p < 0.0001
Currently the solution is to use an (arbitrary) value
TFRC_SMALLEST_P = 40 * 1E-6 = 0.00004
and to consider all values below this value as `virtually zero'. Due to
the exponentially growing resolution error, this is not a good idea, since
it hides the fact that the table can not resolve practically occurring cases.
Already at p == TFRC_SMALLEST_P, the error is as high as 58.19%!
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This
* adds documentation about the lowest resolution that is possible within
the bounds of the current lookup table
* defines a constant TFRC_SMALLEST_P which defines this resolution
* issues a warning if a given value of p is below resolution
* combines two previously adjacent if-blocks of nearly identical
structure into one
This patch does not change the algorithm as such.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
1) For the forward X_calc lookup, it
* protects effectively against RTT=0 (this case is possible), by
returning the maximal lookup value instead of just setting it to 1
* reformulates the array-bounds exceeded condition: this only happens
if p is greater than 1E6 (due to the scaling)
* the case of negative indices can now with certainty be excluded,
since documentation shows that the formulas are within bounds
* additional protection against p = 0 (would give divide-by-zero)
2) For the reverse lookup, it warns against
* protects against exceeding array bounds
* now returns 0 if f(p) = 0, due to function definition
* warns about minimal resolution error and returns the smallest table
value instead of p=0 [this would mask congestion conditions]
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This fixes the following small error in tfrc_calc_x_reverse_lookup.
1) The table is generated by the following equations:
lookup[index][0] = g((index+1) * 1000000/TFRC_CALC_X_ARRSIZE);
lookup[index][1] = g((index+1) * TFRC_CALC_X_SPLIT/TFRC_CALC_X_ARRSIZE);
where g(q) is 1E6 * f(q/1E6)
2) The reverse lookup assigns an entry in lookup[index][small]
3) This index needs to match the above, i.e.
* if small=0 then
p = (index + 1) * 1000000/TFRC_CALC_X_ARRSIZE
* if small=1 then
p = (index+1) * TFRC_CALC_X_SPLIT/TFRC_CALC_X_ARRSIZE
These are exactly the changes that the patch makes; previously the code did
not conform to the way the lookup table was generated (this difference resulted
in a mean error of about 1.12%).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This adds documentation for the TCP Reno throughput equation which is at
the heart of the TFRC sending rate / loss rate calculations.
It spells out precisely how the values were determined and what they mean.
The equations were derived through reverse engineering and found to be
fully accurate (verified using test programs).
This patch does not change any code.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This avoids a (harmless) warning message being printed at the DCCP server
(the receiver of a DCCP half connection).
Incoming packets are both directed to
* ccid_hc_rx_packet_recv() for the server half
* ccid_hc_tx_packet_recv() for the client half
The message gets printed since on a server the client half is currently not
sending data packets.
This is resolved for the moment by checking the DCCP-role first. In future
times (bidirectional DCCP connections), this test may have to be more
sophisticated.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
The main object of this patch is the following bug:
==> In ccid3_hc_tx_packet_recv, the parameters p and X_recv were updated
_after_ the send rate was calculated. This is clearly an error and is
resolved by re-ordering statements.
In addition,
* r_sample is converted from u32 to long to check whether the time difference
was negative (it would otherwise be converted to a large u32 value)
* protection against RTT=0 (this is possible) is provided in a further patch
* t_elapsed is also converted to long, to match the type of r_sample
* adds a a more debugging information regarding current send rates
* various trivial comment/documentation updates
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This bug resulted in ccid3_hc_tx_send_packet returning negative
delay values, which in turn triggered silently dequeueing packets in
dccp_write_xmit. As a result, only a few out of the submitted packets made
it at all onto the network. Occasionally, when dccp_wait_for_ccid was
involved, this also triggered a bug warning since ccid3_hc_tx_send_packet
returned a negative value (which in reality was a negative delay value).
The cause for this bug lies in the comparison
if (delay >= hctx->ccid3hctx_delta)
return delay / 1000L;
The type of `delay' is `long', that of ccid3hctx_delta is `u32'. When comparing
negative long values against u32 values, the test returned `true' whenever delay
was smaller than 0 (meaning the packet was overdue to send).
The fix is by casting, subtracting, and then testing the difference with
regard to 0.
This has been tested and shown to work.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
The TFRC nofeedback timer normally expires after the maximum of 4
RTTs and twice the current send interval (RFC 3448, 4.3). On LANs
with a small RTT this can mean a high processing load and reduced
performance, since then the nofeedback timer is triggered very
frequently.
This patch provides a configuration option to set the bound for the
nofeedback timer, using as default 100 milliseconds.
By setting the configuration option to 0, strict RFC 3448 behaviour
can be enforced for the nofeedback timer.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch implements a suggestion by Ian McDonald and
1) Avoids tests against negative packet lengths by using unsigned int
for packet payload lengths in the CCID send_packet()/packet_sent() routines
2) As a consequence, it removes an now unnecessary test with regard to `len > 0'
in ccid3_hc_tx_packet_sent: that condition is always true, since
* negative packet lengths are avoided
* ccid3_hc_tx_send_packet flags an error whenever the payload length is 0.
As a consequence, ccid3_hc_tx_packet_sent is never called as all errors
returned by ccid_hc_tx_send_packet are caught in dccp_write_xmit
3) Removes the third argument of ccid_hc_tx_send_packet (the `len' parameter),
since it is currently always set to skb->len. The code is updated with regard
to this parameter change.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This implements the larger-initial-windows feature for CCID 3, as described in
section 5 of RFC 4342. When the first feedback packet arrives, the sender can
send up to 2..4 packets per RTT, instead of just one.
The patch further
* reduces the number of timestamping calls by passing the timestamp value
(which is computed in one of the calling functions anyway) as argument
* renames one constant with a very long name into one which is shorter and
resembles the one in RFC 3448 (t_mbi)
* simplifies some of the min_t/max_t cases where both `x', `y' have the same
type
Commiter note: renamed TFRC_t_mbi to TFRC_T_MBI, to follow Linux coding style.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
To reflect the fact that this now is of no effect, not making apps
stop working, just be warned in the system log.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This removes and cleans up unused variables and structures which have become
unnecessary following the introduction of the EWMA patch to automatically track
the CCID 3 receiver/sender packet sizes `s'.
It deprecates the PACKET_SIZE socket option by returning an error code and
printing a deprecation warning if an application tries to read or write this
socket option.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This corrects the setting of the nofeedback timer with regard to RFC
3448 - previously it was not set to max(4*R, 2*s/X) as specified. Using
the maximum of 1 second as upper bound (as it was done before) can have
detrimental effects, especially if R is small.
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This is in response to a request sent earlier by Eric W. Biederman
and replaces all sysctl numbers for net.dccp.default with CTL_UNNUMBERED.
It has been tested to compile and to work.
Commiter note: I've removed the use of CTL_UNNUMBERED, not setting .ctl_name
sets it to 0, that is the what CTL_UNNUMBERED is, reason is
to avoid unneeded source code cluttering.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch
* removes setting t_RTO in ccid3_hc_tx_init (per [RFC 3448, 4.2], t_RTO is
undefined until feedback has been received);
* makes some trivial changes (updates of comments);
* performs a small optimisation by exploiting that the feedback timeout
uses the value of t_ipi. The way it is done is safe, because the timeouts
appear after the changes to t_ipi, ensuring that up-to-date values are used;
* in ccid3_hc_tx_packet_recv, moves the t_rto statement closer to the calculation
of the next_tmout. This makes the code clearer to read and is also safe, since
t_rto is not updated until the next call of ccid3_hc_tx_packet_recv, and is not
read by the functions called via ccid_wait_for_ccid();
* removes a `max' statement in sk_reset_timer, this is not needed since the timeout
value is always greater than 1E6 microseconds.
* adds `XXX'es to highlight that currently the nofeedback timer is set
in a non-standard way
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch:
* consolidates updating of parameters (t_nom, t_ipi, t_delta) which
need to be updated at the same time, since they are inter-dependent
* removes two inline functions which are no longer needed as a result of
the above consolidation
* resolves a FIXME regarding the re-calculation of t_ipi within the nofeedback
timer, in the state where no feedback has previously been received
* ties updating these parameters to updating the sending rate X, exploiting
that all three parameters in turn depend on X; and using a small optimisation
which can reduce the number of required instructions: only update the three
parameters when X really changes
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch concerns updating the value of the nofeedback timer when no feedback
has been received so far.
Since in this case the value of R is still undefined according to [RFC 3448,
4.2], we can not perform step (3) of [RFC 3448, 4.3]. A clarification is
provided in [RFC 4342, sec. 5], which states that in these cases the nofeedback
timer (still) expires "after two seconds".
Many thanks to Ian McDonald for pointing this out and providing the
clarification.
The patch
* implements [RFC 4342, sec. 5] with regard to the above case
* consolidates handling timer restart by
- adding an appropriate jump label and
- initialising the timeout value
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This considers the case - ACK received while no packet has been sent
so far. Resolved by printing a (rate-limited) warning message.
Further removes an unnecessary BUG_ON in ccid3_hc_tx_packet_recv,
received feedback on a terminating connection is simply ignored.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch removes a switch statement which is redundant since,
* nothing is done in states TFRC_SSTATE_NO_SENT/TFRC_SSTATE_NO_FBACK
* it is impossible that the function is called in the state TFRC_SSTATE_TERM, since
--the function is called, in dccp_write_xmit, after ccid3_hc_tx_send_packet
--if ccid3_hc_tx_send_packet is called in state TFRC_SSTATE_TERM, it returns
-EINVAL, which means that ccid3_hc_tx_packet_sent will not be called
(compare dccp_write_xmit)
--> therefore, this case is logically impossible
* the remaining state is TFRC_SSTATE_FBACK which conditionally updates t_ipi, t_nom,
and t_delta. This is a no-op, since
--t_ipi only changes when feedback is received
--however, when feedback arrives via ccid3_hc_tx_packet_recv, there is an identical
code block which performs the same set of operations
--performing the same set of operations again in ccid3_hc_tx_packet_sent therefore
does not change anything, since between the time of receiving the last feedback
(and therefore update of t_ipi, t_nom, and t_delta), the value of t_ipi has not
changed
--since t_ipi has not changed, the values of t_delta and t_nom also do not change,
they depend fully on t_ipi
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This resolves an `XXX' in ccid3_hc_tx_send_packet().
The function is only called on Data and DataAck packets and returns a negative
result on zero-sized messages. This is a reasonable policy since CCID 3 is a
congestion-control module and congestion control on zero-sized Data(Ack)
packets is in a way pathological.
The patch uses a more suitable error code for this case, it returns the Posix.1
code `EBADMSG' ("Not a data message") instead of `ENOTCONN'.
As a result of ignoring zero-sized packets, a the condition for a warning
"First packet is data" in ccid3_hc_tx_packet_sent is always satisfied; this
message has been removed since it will always be printed.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This makes some logically equivalent simplifications, by replacing
rc - values plus goto's with direct return statements.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch performs a simplifying (performance) optimisation:
In each call of the inline function ccid3_calc_new_t_ipi(), the state is
tested against TFRC_SSTATE_NO_FBACK. This is expensive when the function
is called very often. A simpler solution, implemented by this patch, is
to adapt the control flow.
Background:
Now that we can stuff bigger ack vectors into options.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Ack vectors grow proportional to the window size. If an ack vector does not fit
into a single option, it must be spread across multiple options. This patch
will allow for windows to grow larger.
Committer note: Simplified the patch a bit, original algorithm kept.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Commiter note:
This was split from Andrea's original patch, in the process I changed the type
of the ackvec index fields to u16 instead of to int and haven't folded
dccp_ackvec_parse with dccp_ackvec_check_rcv_ackno.
Next patch will actually do the insertion of more than one ackvec per packet,
using, initially, up to a max of 2 ackvecs as per Andrea's original patch, then
I'll work on support for larger ackvecs, be it using a sysctl or using
setsockopt.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Commiter note: original patch was splitted.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This one got lost on the way from Ian to Gerrit to me, fix it.
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This removes a non-referenced variable.
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This makes the code of the dccp_probe module more portable.
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This adds documentation to the CCID 3 rx/tx socket fields, plus some
minor re-formatting.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This reaps the benefit of the earlier patch, which changed the type of
CCID 3 states to use enums, in that many conditions are now simplified
and the number of possible (unexpected) values is greatly reduced.
In a few instances, this also allowed to simplify pre-conditions; where
care has been taken to retain logical equivalence.
[DCCP]: Introduce a consistent BUG/WARN message scheme
This refines the existing set of DCCP messages so that
* BUG(), BUG_ON(), WARN_ON() have meaningful DCCP-specific counterparts
* DCCP_CRIT (for severe warnings) is not rate-limited
* DCCP_WARN() is introduced as rate-limited wrapper
Using these allows a faster and cleaner transition to their original
counterparts once the code has matured into a full DCCP implementation.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Previously the transmit queue was unbounded.
This patch:
* puts a limit on transmit queue length
and sends back EAGAIN if the buffer is full
* sets the TX queue length to a sensible default
* implements tx buffer sysctls for DCCP
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This adds a CCID3 debug option to the configuration menu
which is missing in Kconfig, but already used by the code.
CCID 2 already provides such an entry.
To enable debugging, set CONFIG_IP_DCCP_CCID3_DEBUG=y
NOTE: The use of ccid3_{t,r}x_state_name is safe, since
now only enum values can appear.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch
* makes debugging (when configured) work both for static / module build
* provides generic debugging macros for use in other DCCP / CCID modules
* adds missing information about debug parameters to Kconfig
* performs some code tidy-up
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
These are code optimizations which are relevant when dealing with large
windows. They are not coded the way I would like to, but they do the job for
the short-term. This patch should be more neat.
Commiter note: Changed the seqno comparisions to use {after,before}48 to handle
wrapping.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Spotted by Ian McDonald, tentatively fixed by Gerrit Renker:
http://www.mail-archive.com/dccp%40vger.kernel.org/msg00599.html
Rewritten not to unroll sk_receive_skb, in the common case, i.e. no lock
debugging, its optimized away.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch tackles the following problem:
* the ccid3_hc_{t,r}x_sock define ccid3hc{t,r}x_state as `u8', but
in reality there can only be a few, pre-defined enum names
* this necessitates addiditional checking for unexpected values
which would otherwise be caught by the compiler
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Do not traverse the list of ack vector records [proportional to window size]
when we know we will not find what we are looking for. This is especially
useful because ack vectors are checked twice:
1) Upon parsing of options.
2) Upon notification of a new ack.
All of the work will occur during check #1. Therefore, when check #2 is
performed, no new work will be done. This is now "detected" and there is no
performance hit when doing #2.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch does not change code; it performs some trivial clean/tidy-ups:
* removal of a `debug_prefix' string in favour of the
already existing dccp_role(sk)
* add documentation of structures and constants
* separated out the cases for invalid packets (step 1
of the packet validation)
* removing duplicate statements
* combining declaration & initialisation
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch replaces cryptic feature negotiation messages of type
Oct 31 15:42:20 kernel: dccp_feat_change: feat change type=32 feat=1
Oct 31 15:42:21 kernel: dccp_feat_change: feat change type=34 feat=1
Oct 31 15:42:21 kernel: dccp_feat_change: feat change type=32 feat=5
into ones of type:
Nov 2 13:54:45 kernel: dccp_feat_change: ChangeL(CCID (1), 3)
Nov 2 13:54:45 kernel: dccp_feat_change: ChangeR(CCID (1), 3)
Nov 2 13:54:45 kernel: dccp_feat_change: ChangeL(Ack Ratio (5), 2)
Also,
* completed the feature number list wrt RFC 4340 sec. 6.4
* annotating which ones have been implemented so far
* implemented rudimentary sanity checking in feat.c (FIXMEs)
* some minor fixes
Commiter note: uninlined dccp_feat_name and dccp_feat_typename, for
consistency with dccp_{state,packet}_name, that, BTW,
should be compiled only if CONFIG_IP_DCCP_DEBUG is
selected, leaving this to another cset tho. Also
shortened dccp_feat_negotiation_debug to dccp_feat_debug.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Resolves the problem that if IPv6 was configured `y' and DCCP `m' then
dccp_ipv6 was not built as a module. With this change, dccp_ipv6 is built
as a module whenever DCCP *OR* IPv6 are configured as modules; it will be
built-in only if both DCCP = `y' and IPV6 = `y'.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Throughout the TCP/DCCP (and tunnelling) code, it often happens that the
return code of a transmit function needs to be tested against NET_XMIT_CN
which is a value that does not indicate a strict error condition.
This patch uses a macro for these recurring situations which is consistent
with the already existing macro net_xmit_errno, saving on duplicated code.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This
* resolves a FIXME - DCCPv6 connections started all with
an initial sequence number of 1;
* provides a redirection `secure_dccpv6_sequence_number'
in case the init_sequence_v6 code should be updated later;
* concentrates the update of S.GAR into dccp_connect_init();
* removes a duplicate dccp_update_gss() in ipv4.c;
* uses inet->dport instead of usin->sin_port, due to the
following assignment in dccp_v4_connect():
inet->dport = usin->sin_port;
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch removes the following redundancies:
1) The test skb->protocol == htons(ETH_P_IPV6) in dccp_v6_init_sequence
is always true since
* dccp_v6_conn_request() is the only calling function
* dccp_v6_conn_request() redirects all skb's with ETH_P_IP to
dccp_v4_conn_request()
2) The first argument, `struct sock *sk', of dccp_v{4,6}_init_sequence()
is never used.
(This is similar for tcp_v{4,6}_init_sequence, an analogous patch has been
submitted to netdev and merged.)
By the way - are the `sport' / `dport' arguments in the right order?
I have made them consistent among calls but they seem to be in the
reverse order.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This removes 3 forward declarations by reordering 2 functions.
No code change at all.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
In order to make their function clearer and obtain a consistent naming
scheme to identify sysctls, all existing DCCP sysctls have been prefixed
with `sysctl_dccp', following the same convention as used by TCP.
Feature-specific sysctls retain the `feat' in the middle, although the
`default' has been dropped, since it is obvious from use.
Also removed a duplicate `dccp_feat_default_sequence_window' in ipv4.c.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This adds 3 sysctls which govern the retransmission behaviour of DCCP control
packets (3way handshake, feature negotiation).
It removes 4 FIXMEs from the code.
The close resemblance of sysctl variables to their TCP analogues is emphasised
not only by their name, but also by giving them the same initial values.
This is useful since there is not much practical experience with DCCP yet.
Furthermore, with regard to the previous patch, it is now possible to limit
the number of keepalive-Responses by setting net.dccp.default.request_retries
(also a bit like in TCP).
Lastly, added documentation of all existing DCCP sysctls.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This updates program documentation: spell out precise conditions about
which packets are eligible for retransmission (which is actually quite
hard to extract from RFC 4340).
It is based on the following table derived from RFC 4340:
+-----------+---------------------------------+---------------------+
| Type | Retransmit? | Remark |
+-----------+---------------------------------+---------------------+
| Request | in client-REQUEST state | sec. 8.1.1 |
| Response | NEVER | SHOULD NOT, 8.1.3 |
| Data | NEVER | unreliable protocol |
| Ack | possible in client-PARTOPEN | sec. 8.1.5 |
| DataAck | NEVER | unreliable protocol |
| CloseReq | only in server-CLOSEREQ state | MUST, sec. 8.3 |
| Close | in node-CLOSING state | MUST, sec. 8.3 |
+-----------+-------------------------------------------------------+
| Reset | only in response to other packets |
| Sync | only in response to sequence-invalid packets (7.5.4) |
| SyncAck | only in response to Sync packets |
+-----------+-------------------------------------------------------+
Hence the only packets eligible for retransmission are:
* Requests in client-REQUEST state (sec. 8.1.1)
* Acks in client-PARTOPEN state (sec. 8.1.5)
* CloseReq in server-CLOSEREQ state (sec. 8.3)
* Close in node-CLOSING state (sec. 8.3)
I had meant to put in a check for these types too, but have left that
for later.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch does the following:
a) introduces variable-length checksums as specified in [RFC 4340, sec. 9.2]
b) provides necessary socket options and documentation as to how to use them
c) basic support and infrastructure for the Minimum Checksum Coverage feature
[RFC 4340, sec. 9.2.1]: acceptability tests, user notification and user
interface
In addition, it
(1) fixes two bugs in the DCCPv4 checksum computation:
* pseudo-header used checksum_len instead of skb->len
* incorrect checksum coverage calculation based on dccph_x
(2) removes dccp_v4_verify_checksum() since it reduplicates code of the
checksum computation; code calling this function is updated accordingly.
(3) now uses skb_checksum(), which is safer than checksum_partial() if the
sk_buff has is a non-linear buffer (has pages attached to it).
(4) fixes an outstanding TODO item:
* If P.CsCov is too large for the packet size, drop packet and return.
The code has been tested with applications, the latest version of tcpdump now
comes with support for partial DCCP checksums.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Sorts out the comments for processing steps 2,3 in section 8.5 of RFC 4340.
All comments have been updated against this document, and the reference to step
2 has been made consistent throughout the files.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch fixes data being spewed into the logs continually. As the
code stood if there was a large queue and long delays timeo would go
down to zero and never get reset.
This fixes it by resetting timeo. Put constant into header as well.
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Fixes a typo in Kconfig, patch is by Ian McDonald and is re-sent from
http://www.mail-archive.com/dccp@vger.kernel.org/msg00579.html
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This does the same for ipv6.c as the preceding one does for ipv4.c: Only the
inet_connection_sock_af_ops forward declarations remain, since at least
dccp_ipv6_mapped has a circular dependency to dccp_v6_request_recv_sock.
No code change, merely re-ordering.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch removes two functions, the send_ack functions of request_sock,
which are not called/used by the DCCP code. It is correct that these
functions are not called, below is a justification why calling these
functions (on a passive socket in the LISTEN/RESPOND state) would mean
a DCCP protocol violation.
A) Background: using request_sock in TCP:
Gerrit Renker noticed dccp_tw_deschedule and submitted a patch with a FIXME,
but as he suggests in the same patch the best thing is to just ditch this
declaration, while doing that also noticed that tcp_tw_count is as well not
defined anywhere, so ditch it too.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This is a code simplification and was singled out from the
DCCPv6 Oops patch on
http://www.mail-archive.com/dccp@vger.kernel.org/msg00600.html
It mainly makes the code consistent between ipv{4,6}.c for the functions
dccp_v4_rcv
dccp_v6_rcv
and removes the do_time_wait label to simplify code somewhat.
Commiter note: fixed up a compile problem, trivial.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This is a code simplification:
it combines three often recurring operations into one inline function,
* allocate `len' bytes header space in skb
* fill these `len' bytes with zeroes
* cast the start of this header space as dccp_hdr
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This is a re-send from
http://www.mail-archive.com/dccp@vger.kernel.org/msg00553.html
It is the same patch as before, but I have built in Arnaldo's suggestions
pointed out in that posting.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
The data itself is already charged to the SKB, doing
the skb_set_owner_w() just generates a lot of noise and
extra atomics we don't really need.
Lmbench improvements on lat_tcp are minimal:
before:
TCP latency using localhost: 23.2701 microseconds
TCP latency using localhost: 23.1994 microseconds
TCP latency using localhost: 23.2257 microseconds
after:
TCP latency using localhost: 22.8380 microseconds
TCP latency using localhost: 22.9465 microseconds
TCP latency using localhost: 22.8462 microseconds
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently allocate a fixed size (TCP_SYNQ_HSIZE=512) slots hash table for
each LISTEN socket, regardless of various parameters (listen backlog for
example)
On x86_64, this means order-1 allocations (might fail), even for 'small'
sockets, expecting few connections. On the contrary, a huge server wanting a
backlog of 50000 is slowed down a bit because of this fixed limit.
This patch makes the sizing of listen hash table a dynamic parameter,
depending of :
- net.core.somaxconn tunable (default is 128)
- net.ipv4.tcp_max_syn_backlog tunable (default : 256, 1024 or 128)
- backlog value given by user application (2nd parameter of listen())
For large allocations (bigger than PAGE_SIZE), we use vmalloc() instead of
kmalloc().
We still limit memory allocation with the two existing tunables (somaxconn &
tcp_max_syn_backlog). So for standard setups, this patch actually reduce RAM
usage.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value of kfifo_alloc() should be checked by IS_ERR().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP and RAW do not have this issue. Closes Bug #7432.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix printk format warnings:
build2.out:net/dccp/ccids/ccid2.c:355: warning: long long unsigned int format, u64 arg (arg 3)
build2.out:net/dccp/ccids/ccid2.c:360: warning: long long unsigned int format, u64 arg (arg 3)
build2.out:net/dccp/ccids/ccid2.c:482: warning: long long unsigned int format, u64 arg (arg 5)
build2.out:net/dccp/ccids/ccid2.c:639: warning: long long unsigned int format, u64 arg (arg 3)
build2.out:net/dccp/ccids/ccid2.c:639: warning: long long unsigned int format, u64 arg (arg 4)
build2.out:net/dccp/ccids/ccid2.c:674: warning: long long unsigned int format, u64 arg (arg 3)
build2.out:net/dccp/ccids/ccid2.c:720: warning: long long unsigned int format, u64 arg (arg 3)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Updates the references to spec documents throughout the code, taking into
account that
* the DCCP, CCID 2, and CCID 3 drafts all became RFCs in March this year
* RFC 1063 was obsoleted by RFC 1191
* draft-ietf-tcpimpl-pmtud-0x.txt was published as an Informational
RFC, RFC 2923 on 2000-09-22.
All references verified.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a patch from Jesper Juhl. Try to match the
TCP IPv6 code this was copied from as much as possible,
so that it's easy to see where to add the ipv6 pktoptions
support code.
Signed-off-by: David S. Miller <davem@davemloft.net>
I think I got the cause for the Oops observed in
http://www.mail-archive.com/dccp@vger.kernel.org/msg00578.html
The problem is always with applications listening on PF_INET6 sockets. Apart
from the mentioned oops, I observed another one one, triggered at irregular
intervals via timer interrupt:
run_timer_softirq -> dccp_keepalive_timer
-> inet_csk_reqsk_queue_prune
-> reqsk_free
-> dccp_v6_reqsk_destructor
The latter function is the problem and is also the last function to be called
in said kernel panic.
In any case, there is a real problem with allocating the right request_sock
which is what this patch tackles.
It fixes the following problem:
- application listens on PF_INET6
- DCCPv4 packet comes in, is handed over to dccp_v4_do_rcv, from there
to dccp_v4_conn_request
Now: socket is PF_INET6, packet is IPv4. The following code then furnishes the
connection with IPv6 - request_sock operations:
req = reqsk_alloc(sk->sk_prot->rsk_prot);
The first problem is that all further incoming packets will get a Reset since
the connection can not be looked up.
The second problem is worse:
--> reqsk_alloc is called instead of inet6_reqsk_alloc
--> consequently inet6_rsk_offset is never set (dangling pointer)
--> the request_sock_ops are nevertheless still dccp6_request_ops
--> destructor is called via reqsk_free
--> dccp_v6_reqsk_destructor tries to free random memory location (inet6_rsk_offset not set)
--> panic
I have tested this for a while, DCCP sockets are now handled correctly in all
three scenarios (v4/v6 only/v4-mapped).
Commiter note: I've added the dccp_request_sock_ops forward declaration to keep
the tree building and to reduce the size of the patch for 2.6.19,
later I'll move the functions to the top of the affected source
code to match what we have in the TCP counterpart, where this
problem hasn't existed in the first place, dumb me not to have
done the same thing on DCCP land 8)
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
annotated address arguments (port number left alone for now); ditto
for inferred net-endian variables in callers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds DCCP probing shamelessly ripped off from TCP probes by Stephen
Hemminger.
I've put in here support for further CCID3 variables as well.
Andrea/Arnaldo might look to extend for CCID2.
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
With constants for CCID numbers this now uses them in some places.
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This has been discussed on dccp@vger and removes the necessity for applications
to supply service codes in each and every case.
If an application does not want to provide a service code, that's fine, it will
be given 0. Otherwise, service codes can be set via socket options as before.
This patch has been tested using various client/server configurations
(including listening on multiple service codes).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Introduce methods which manipulate interesting congestion control
state such as pipe and rtt estimate. This is useful for people
wishing to monitor the variables of CCID and instrument the code
[perhaps using Kprobes]. Personally, I am a fan of
encapsulation---that justifies this change =D.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When multiple losses occur in one RTT, the window should be halved
only once [a single "congestion event"]. This is now implemented,
although not perfectly. Slightly changed the interface for changing
the cwnd: pass hctx instead of dp. This is required in order to allow
for change_cwnd to be called from _init().
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allocate more sequence state on demand. Each time a packet is sent
out by CCID2, a record of it needs to be kept. This list of records
grows proportionally to cwnd. Previously, the length of this list was
hardcored and therefore the cwnd could only grow to this value (of
128). Now, records are allocated on demand as necessary---cwnd may
grow as it wishes. The exceptional case of when memory is not
available is not handled gracefully. Perhaps, cwnd should be capped
at that point.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the user to choose whether or not to enable CCID2 debugging via
Kconfig.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If not enough cwnd is available, tell the sender to check again as
soon as possible. This will increase CPU utilization (polling
frequently for cwnd) but will improve network performance. That is,
the sender will need to wait less before detecting the increase of
cwnd. A better architecture would be for the CCID to call-back (or
dequeue) from DCCP when it is able to transmit traffic -- not the
other way around as it currently occurs.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the slow-start threshold to infinity. This way, upon connection
initiation, slow-start will be exited only upon a packet loss. This patch will
allow connections to quickly gain speed.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiffies are now handled correctly (I hope) in CCID2. If they wrap, no
problem.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of unused variables in ackvector state.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the way state is masked out. DCCP_ACKVEC_STATE_NOT_RECEIVED is
defined as appears in the packet, therefore bit shifting is not
required. This fix allows CCID2 to correctly detect losses.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix ackvector length calculation upon receiving an "ack-of-ack". This
patch avoids the ackvector from growing too large which causes it to
not be inserted into packets.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function sk_filter() is called from tcp_v{4,6}_rcv() functions with arg
needlock = 0, while socket is not locked at that moment. In order to avoid
this and similar issues in the future, use rcu for sk->sk_filter field read
protection.
Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
As Arnaldo Carvalho de Melo points out I should be using list_entry in case
the structure changes in future. Current code functions but is reliant
on position and requires type cast.
Noticed when doing this that I have one more variable than I needed so
removing that also.
Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds transmit buffering to DCCP.
I have tested with CCID2/3 and with loss and rate limiting.
Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This shifts further sysctls into feat.h. No change in
functionality - shifting code only.
Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on MIPL2 kernel patch.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Ville Nuorvala <vnuorval@tcs.hut.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now most inet_lookup_* functions take a host-order hnum instead
of a network-order dport because that's how it is represented
internally.
This means that users of these functions have to be careful about
using the right byte-order. To add more confusion, inet_lookup takes
a network-order dport unlike all other functions.
So this patch changes all visible inet_lookup functions to take a
dport and move all dport->hnum conversion inside them.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This automatically labels the TCP, Unix stream, and dccp child sockets
as well as openreqs to be at the same MLS level as the peer. This will
result in the selection of appropriately labeled IPSec Security
Associations.
This also uses the sock's sid (as opposed to the isec sid) in SELinux
enforcement of secmark in rcv_skb and postroute_last hooks.
Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This labels the flows that could utilize IPSec xfrms at the points the
flows are defined so that IPSec policy and SAs at the right label can
be used.
The following protos are currently not handled, but they should
continue to be able to use single-labeled IPSec like they currently
do.
ipmr
ip_gre
ipip
igmp
sit
sctp
ip6_tunnel (IPv6 over IPv6 tunnel device)
decnet
Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes CCID3 to give much closer performance to RFC4342.
CCID3 is meant to alter sending rate based on RTT and loss.
The performance was verified against:
http://wand.net.nz/~perry/max_download.php
For example I tested with netem and had the following parameters:
Delayed Acks 1, MSS 256 bytes, RTT 105 ms, packet loss 5%.
This gives a theoretical speed of 71.9 Kbits/s. I measured across three
runs with this patch set and got 70.1 Kbits/s. Without this patchset the
average was 232 Kbits/s which means Linux can't be used for CCID3 research
properly.
I also tested with netem turned off so box just acting as router with 1.2
msec RTT. The performance with this is the same with or without the patch
at around 30 Mbit/s.
Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a new function dccp_rx_hist_find_entry.
Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a new function to see if two sequence numbers follow each
other.
Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a small typo in net/dccp/libs/packet_history.c
Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current users of ip6_dst_lookup can be divided into two classes:
1) The caller holds no locks and is in user-context (UDP).
2) The caller does not want to lookup the dst cache at all.
The second class covers everyone except UDP because most people do
the cache lookup directly before calling ip6_dst_lookup. This patch
adds ip6_sk_dst_lookup for the first class.
Similarly ip6_dst_store users can be divded into those that need to
take the socket dst lock and those that don't. This patch adds
__ip6_dst_store for those (everyone except UDP/datagram) that don't
need an extra lock.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using the default sequence window size (100) I got the following in
my logs:
Jun 22 14:24:09 localhost kernel: [ 1492.114775] DCCP: Step 6 failed for
DATA packet, (LSWL(6279674225) <= P.seqno(6279674749) <=
S.SWH(6279674324)) and (P.ackno doesn't exist or LAWL(18798206530) <=
P.ackno(1125899906842620) <= S.AWH(18798206548), sending SYNC...
Jun 22 14:24:09 localhost kernel: [ 1492.115147] DCCP: Step 6 failed for
DATA packet, (LSWL(6279674225) <= P.seqno(6279674750) <=
S.SWH(6279674324)) and (P.ackno doesn't exist or LAWL(18798206530) <=
P.ackno(1125899906842620) <= S.AWH(18798206549), sending SYNC...
I went to alter the default sysctl and it didn't take for new sockets.
Below patch fixes this.
I think the default is too low but it is what the DCCP spec specifies.
As a side effect of this my rx speed using iperf goes from about 2.8 Mbits/sec
to 3.5. This is still far too slow but it is a step in the right direction.
Compile tested only for IPv6 but not particularly complex change.
Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No actual bugs that I can see just a couple of unmarked casts
getting annoying in my debug log files.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Default values for boolean and tristate options can only be 'y', 'm' or 'n'.
This patch removes wrong default for IP_DCCP_ACKVEC.
Signed-off-by: Jean-Luc Leger <jean-luc.leger@dspnet.fr.eu.org>
Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add an extra argument to sk_eat_skb, and make it move early copied
packets to the async_wait_queue instead of freeing them.
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A soft lockup existed in the handling of ack vector records.
Specifically, when a tail of the list of ack vector records was
removed, it was possible to end up iterating infinitely on an element
of the tail.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling sock_orphan inside bh_lock_sock in dccp_close can lead to dead
locks. For example, the inet_diag code holds sk_callback_lock without
disabling BH. If an inbound packet arrives during that admittedly tiny
window, it will cause a dead lock on bh_lock_sock. Another possible
path would be through sock_wfree if the network device driver frees the
tx skb in process context with BH enabled.
We can fix this by moving sock_orphan out of bh_lock_sock.
The tricky bit is to work out when we need to destroy the socket
ourselves and when it has already been destroyed by someone else.
By moving sock_orphan before the release_sock we can solve this
problem. This is because as long as we own the socket lock its
state cannot change.
So we simply record the socket state before the release_sock
and then check the state again after we regain the socket lock.
If the socket state has transitioned to DCCP_CLOSED in the time being,
we know that the socket has been destroyed. Otherwise the socket is
still ours to keep.
This problem was discoverd by Ingo Molnar using his lock validator.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
we dont free req if we cant parse the options.
This fixes coverity bug id #1046
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Randy Dunlap <rdunlap@xenotime.net>
Use NULL instead of 0 for pointers.
Fix these sparse warnings:
net/dccp/feat.c:207:20: warning: Using plain integer as NULL pointer
net/dccp/feat.c:325:21: warning: Using plain integer as NULL pointer
net/dccp/feat.c:526:20: warning: Using plain integer as NULL pointer
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement the half-closed devices notifiation, by adding a new POLLRDHUP
(and its alias EPOLLRDHUP) bit to the existing poll/select sets. Since the
existing POLLHUP handling, that does not report correctly half-closed
devices, was feared to be changed, this implementation leaves the current
POLLHUP reporting unchanged and simply add a new bit that is set in the few
places where it makes sense. The same thing was discussed and conceptually
agreed quite some time ago:
http://lkml.org/lkml/2003/7/12/116
Since this new event bit is added to the existing Linux poll infrastruture,
even the existing poll/select system calls will be able to use it. As far
as the existing POLLHUP handling, the patch leaves it as is. The
pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
archs and sets the bit in the six relevant files. The other attached diff
is the simple change required to sys/epoll.h to add the EPOLLRDHUP
definition.
There is "a stupid program" to test POLLRDHUP delivery here:
http://www.xmailserver.org/pollrdhup-test.c
It tests poll(2), but since the delivery is same epoll(2) will work equally.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is in preparation for having a dccp_minisock embedded into
dccp_request_sock so that feature negotiation can be done prior to
creating the full blown dccp_sock.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will later be included in struct dccp_request_sock so that we can
have per connection feature negotiation state while in the 3way
handshake, when we clone the DCCP_ROLE_LISTEN socket (in
dccp_create_openreq_child) we'll just copy this state from
dreq_minisock to dccps_minisock.
Also the feature negotiation and option parsing code will mostly touch
dccps_minisock, which will simplify some stuff.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No code changes, just tidying up, in some cases moving EXPORT_SYMBOLs
to just after the function exported, etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends {get|set}sockopt compatibility layer in order to
move protocol specific parts to their place and avoid huge universal
net/compat.c file in the future.
Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
And not the silly LIMIT_NETDEBUG and silently return without inserting
the option requested.
Also drop some old debugging messages associated to option insertion.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merging it with its only user: dccp_v[46]_reqsk_send_ack.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using this also provides opportunities for introducing
inet_csk_alloc_skb that would call alloc_skb, account it to the sock
and skb_reserve(max_header), but I'll leave this for later, for now
using sk_prot->max_header consistently is enough.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I.e. they should be just ignored, but we have to use 'break', not 'continue',
as we have to possibly reset the mandatory flag.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to dccp draft (draft-ietf-dccp-spec-13.txt) section 5.8.2
(Mandatory Option) the following patch correct the handling of the
following cases:
1) "... and any Mandatory options received on DCCP-Data packets MUST be
ignored."
2) "The connection is in error and should be reset with Reset Code 5, ...
if option O is absent (Mandatory was the last byte of the option list), or
if option O equals Mandatory."
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
No changes in the logic were made, just removing trailing whitespaces,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidating open coded sequences in tcp and dccp, v4 and v6.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I guess I forgot to add it, nah, now it just works:
18:04:33.274066 IP6 ::1.1476 > ::1.5001: request (service=0)
18:04:33.334482 IP6 ::1.5001 > ::1.1476: reset (code=bad_service_code)
Ditched IP_DCCP_UNLOAD_HACK, as now we would have to do it for both
IPv6 and IPv4, so I'll come up with another way for freeing the
control sockets in upcoming changesets.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no reason for struct dccp_v4_prot being global.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch in place we can break down the complexity by better
compartmentalizing the code that is common to ipv6 and ipv4.
Now we have these modules:
Module Size Used by
dccp_diag 1344 0
inet_diag 9448 1 dccp_diag
dccp_ccid3 15856 0
dccp_tfrc_lib 12320 1 dccp_ccid3
dccp_ccid2 5764 0
dccp_ipv4 16996 2
dccp 48208 4 dccp_diag,dccp_ccid3,dccp_ccid2,dccp_ipv4
dccp_ipv6 still requires dccp_ipv4 due to dccp_ipv6_mapped, that is
the next target to work on the "hey, ipv4 is legacy, I only want ipv6
dude!" direction.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And introduce dccp_mib_exit grouping previously open coded sequence.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dccp_make_response is shared by ipv4/6 and the ipv6 code was
recalculating the checksum, not good, so move the dccp_v4_checksum
call to dccp_v4_send_response.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As this is used by both ipv4 and ipv6 and is not ipv4 specific.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removing one more ipv6 uses ipv4 stuff case in dccp land.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Renaming it to dccp_send_reset and moving it from the ipv4 specific
code to the core dccp code.
This fixes some bugs in IPV6 where timers would send v4 resets, etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[root@qemu ~]# for a in /proc/sys/net/dccp/default/* ; do echo $a ; cat $a ; done
/proc/sys/net/dccp/default/ack_ratio
2
/proc/sys/net/dccp/default/rx_ccid
3
/proc/sys/net/dccp/default/send_ackvec
1
/proc/sys/net/dccp/default/send_ndp
1
/proc/sys/net/dccp/default/seq_window
100
/proc/sys/net/dccp/default/tx_ccid
3
[root@qemu ~]#
So if wanting to test ccid3 as the tx CCID one can just do:
[root@qemu ~]# echo 3 > /proc/sys/net/dccp/default/tx_ccid
[root@qemu ~]# echo 2 > /proc/sys/net/dccp/default/rx_ccid
[root@qemu ~]# cat /proc/sys/net/dccp/default/[tr]x_ccid
2
3
[root@qemu ~]#
Of course we also need the setsockopt for each app to tell its preferences, but
for testing or defining something other than CCID2 as the default for apps that
don't explicitely set their preference the sysctl interface is handy.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that dccp_feat_clean doesn't get confused with uninitialized
list_heads.
Noticed when testing with no ccid kernel modules.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make CCID2 and CCID3 default to what was selected for DCCP and use the
standard short description for the CCIDs (TCP-Like & TCP-Friendly).
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also fixes the layout of dccp_hdr short sequence numbers, problem
was not fatal now as we only support long (48 bits) sequence numbers.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the CCID upon successful feature negotiation.
Commiter note: patch mostly rewritten to use the new ccid API.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. No need for ->ccid_init nor ->ccid_exit, this is what module_{init,exit}
does and anynways neither ccid2 nor ccid3 were using it.
2. Rename struct ccid to struct ccid_operations and introduce struct ccid
with a pointer to ccid_operations and rigth after it the rx or tx
private state.
3. Remove the pointer to the state of the half connections from struct
dccp_sock, now its derived thru ccid_priv() from the ccid pointer.
Now we also can implement the setsockopt for changing the CCID easily as
no ccid init routines can affect struct dccp_sock in any way that prevents
other CCIDs from working if a CCID switch operation is asked by apps.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There was a hybrid use of standard timers and sk_timers. This caused
the reference count of the sock to be incorrect when resetting the RTO
timer. The sock reference count should now be correct, enabling its
destruction, and allowing the DCCP module to be unloaded.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Original work by Andrea Bittau, Arnaldo Melo cleaned up and fixed several
issues on the merge process.
For now CCID2 was turned the default for all SOCK_DCCP connections, but this
will be remedied soon with the merge of the feature negotiation code.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Testing if the ccid being instantiated has these methods in
ccid_init().
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on a patch by Andrea Bittau.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplifying the code a bit as we're always using DCCP_MAX_ACKVEC_LEN.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rare circumstances 0 is returned by dccp_li_hist_calc_i_mean which
leads to a divide by zero in ccid3_hc_rx_packet_recv. Explicitly check
for zero return now. Update copyright notice at same time.
Found by Arnaldo.
Signed-off-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A bunch of asm/bug.h includes are both not needed (since it will get
pulled anyway) and bogus (since they are done too early). Removed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It was copy&pasted from tcp_v6_send_synack() which has
a DST leak recently fixed by Eric W. Biederman.
So dccp_v6_send_response() needs the same fix too.
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_route_newports uses the struct flowi from the struct rtable returned
by ip_route_connect for the new route lookup and just replaces the port
numbers if they have changed. If an IPsec policy exists which doesn't match
port 0 the struct flowi won't have the proto field set and no xfrm lookup
is done for the changed ports.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle NAT of decapsulated IPsec packets by reconstructing the struct flowi
of the original packet from the conntrack information for IPsec policy
checks.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Keep the conntrack reference until policy checks have been performed for
IPsec NAT support. The reference needs to be dropped before a packet is
queued to avoid having the conntrack module unloadable.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move nextheader offset to the IP6CB to make it possible to pass a
packet to ip6_input_finish multiple times and have it skip already
parsed headers. As a nice side effect this gets rid of the manual
hopopts skipping in ip6_input_finish.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CCID should be notified of packet reception only when a packet is
valid. Therefore, the ACK vector needs to be processed before
notifying the CCID. Also, the CCID might need information provided by
the ACK vector.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If ACK vectors are used, each packet with an ACK should contain an ACK
vector. The only exception currently is response packets. It
probably is not a good idea to store ACK vector state before the
connection is completed (to help protect from syn floods).
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When packets are received, the connection is either in DCCP_OPEN
[fast-path] or it isn't. If it's not [e.g. DCCP_PARTOPEN] upper
layers will perform sanity checks and parse options. If it is in
DCCP_OPEN, dccp_rcv_established() will do it. It is important not to
re-parse options in dccp_rcv_established() when it is not called from
the fast-path. Else, fore example, the ack vector will be added twice
and the CCID will see the packet twice.
The solution is to always enfore sanity checks from the upper layers.
When packets arrive in the fast-path, sanity checks will be performed
before calling dccp_rcv_established().
Note(acme): I rewrote the patch to achieve the same result but keeping
dccp_rcv_established with the previous semantics and having it split
into __dccp_rcv_established, that doesn't does do any sanity check,
code in state != DCCP_OPEN use this lighter version as they already do
the sanity checks.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.
Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Its common enough to to justify that, TCP still can't use it as it has the
prequeueing stuff, still to be made generic in the not so distant future :-)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed that some of 'struct proto_ops' used in the kernel may share
a cache line used by locks or other heavily modified data. (default
linker alignement is 32 bytes, and L1_CACHE_LINE is 64 or 128 at
least)
This patch makes sure a 'struct proto_ops' can be declared as const,
so that all cpus can share all parts of it without false sharing.
This is not mandatory : a driver can still use a read/write structure
if it needs to (and eventually a __read_mostly)
I made a global stubstitute to change all existing occurences to make
them const.
This should reduce the possibility of false sharing on SMP, and
speedup some socket system calls.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As DCCP needs to be called in the same spots.
Now we have a member in inet_sock (is_icsk), set at sock creation time from
struct inet_protosw->flags (if INET_PROTOSW_ICSK is set, like for TCP and
DCCP) to see if a struct sock instance is a inet_connection_sock for places
like the ones in ip_sockglue.c (v4 and v6) where we previously were looking if
sk_type was SOCK_STREAM, that is insufficient because we now use the same code
for DCCP, that has sk_type SOCK_DCCP.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Renaming it to inet6_hash_connect, making it possible to ditch
dccp_v6_hash_connect and share the same code with TCP instead.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Renaming it to inet_hash_connect, making it possible to ditch
dccp_v4_hash_connect and share the same code with TCP instead.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can share several timewait sockets related functions and
make the timewait mini sockets infrastructure closer to the request
mini sockets one.
Next changesets will take advantage of this, moving more code out of
TCP and DCCP v4 and v6 to common infrastructure.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we have the destructor (dccp_v4_reqsk_destructor) in our
request_sock_ops vtable.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Still needs mucho polishing, specially in the checksum code, but works
just fine, inet_diag/iproute2 and all 8)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Basically exports a similar set of functions as the one exported by
the non-AF specific TCP code.
In the process moved some non-AF specific code from dccp_v4_connect to
dccp_connect_init and moved the checksum verification from
dccp_invalid_packet to dccp_v4_rcv, so as to use it in dccp_v6_rcv
too.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And make the core DCCP code AF agnostic, just like TCP, now its time
to work on net/dccp/ipv6.c, we are close to the end!
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I hope to actually change this behaviour shortly but this will help
anybody grepping code at present.
Signed-off-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Jesper Juhl <jesper.juhl@gmail.com>
This is the net/ part of the big kfree cleanup patch.
Remove pointless checks for NULL prior to calling kfree() in net/.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
This patch randomizes the port selected on bind() for connections
to help with possible security attacks. It should also be faster
in most cases because there is no need for a global lock.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Here is a complimentary insurance policy for those feeling a bit insecure.
You don't have to accept this. However, if you do, you can't blame me for
it :)
> 1) dccp_transmit_skb sets the owner for all packets except data packets.
We can actually verify this by looking at pkt_type.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
While we're at it let's reorganise the set_owner_w calls a little so that:
1) dccp_transmit_skb sets the owner for all packets except data packets.
2) Add dccp_skb_entail to set owner for packets queued for retransmission.
3) Make dccp_transmit_skb static.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Turns out the problem has nothing to do with use-after-free or double-free.
It's just that we're not clearing the CB area and DCCP unlike TCP uses a CB
format that's incompatible with IP.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
icmp_send doesn't use skb->sk at all so even if skb->sk has already
been freed it can't cause crash there (it would've crashed somewhere
else first, e.g., ip_queue_xmit).
I found a double-free on an skb that could explain this though.
dccp_sendmsg and dccp_write_xmit are a little confused as to what
should free the packet when something goes wrong. Sometimes they
both go for the ball and end up in each other's way.
This patch makes dccp_write_xmit always free the packet no matter
what. This makes sense since dccp_transmit_skb which in turn comes
from the fact that ip_queue_xmit always frees the packet.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
David S. Miller <davem@davemloft.net> wrote:
> One thing you can probably do for this bug is to mark data packets
> explicitly somehow, perhaps in the SKB control block DCCP already
> uses for other data. Put some boolean in there, set it true for
> data packets. Then change the test in dccp_transmit_skb() as
> appropriate to test the boolean flag instead of "skb_cloned(skb)".
I agree. In fact we already have that flag, it's called skb->sk.
So here is patch to test that instead of skb_cloned().
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Noticed by Andrea Bittau, that provided a patch that was modified to
not transition from RESPOND to OPEN when receiving DATA packets.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
For consistency with ccid_exit and to fix a bug when
IP_DCCP_UNLOAD_HACK is enabled as the control sock is not associated
to any CCID.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Arnaldo and I agreed it could be applied now, because I have other
pending patches depending on this one (Thank you Arnaldo)
(The other important patch moves skc_refcnt in a separate cache line,
so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)
1) First some performance data :
--------------------------------
tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()
The most time critical code is :
sk_for_each(sk, node, &head->chain) {
if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
goto hit; /* You sunk my battleship! */
}
The sk_for_each() does use prefetch() hints but only the begining of
"struct sock" is prefetched.
As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
away from the begining of "struct sock", it has to bring into CPU
cache cold cache line. Each iteration has to use at least 2 cache
lines.
This can be problematic if some chains are very long.
2) The goal
-----------
The idea I had is to change things so that INET_MATCH() may return
FALSE in 99% of cases only using the data already in the CPU cache,
using one cache line per iteration.
3) Description of the patch
---------------------------
Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
filling a 32 bits hole on 64 bits platform.
struct sock_common {
unsigned short skc_family;
volatile unsigned char skc_state;
unsigned char skc_reuse;
int skc_bound_dev_if;
struct hlist_node skc_node;
struct hlist_node skc_bind_node;
atomic_t skc_refcnt;
+ unsigned int skc_hash;
struct proto *skc_prot;
};
Store in this 32 bits field the full hash, not masked by (ehash_size -
1) Using this full hash as the first comparison done in INET_MATCH
permits us immediatly skip the element without touching a second cache
line in case of a miss.
Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
sk_hash and tw_hash) already contains the slot number if we mask with
(ehash_size - 1)
File include/net/inet_hashtables.h
64 bits platforms :
#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
(((__sk)->sk_hash == (__hash))
((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \
((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
32bits platforms:
#define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
(((__sk)->sk_hash == (__hash)) && \
(inet_sk(__sk)->daddr == (__saddr)) && \
(inet_sk(__sk)->rcv_saddr == (__daddr)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
- Adds a prefetch(head->chain.first) in
__inet_lookup_established()/__tcp_v4_check_established() and
__inet6_lookup_established()/__tcp_v6_check_established() and
__dccp_v4_check_established() to bring into cache the first element of the
list, before the {read|write}_lock(&head->lock);
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allocation for the optnames is similar to the DCCP options, with a
range for rx and tx half connection CCIDs.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moving the TFRC sender and receiver variables to separate structs, so
that we can copy these structs to userspace thru getsockopt,
dccp_diag, etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Isolating it, that will be used when we introduce a CCID2 (TCP-Like)
implementation.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As discussed in the dccp@vger mailing list:
Now applications have to use setsockopt(DCCP_SOCKOPT_SERVICE, service[s]),
prior to calling listen() and connect().
An array of unsigned ints can be passed meaning that the listening sock accepts
connection requests for several services.
With this we can ditch struct sockaddr_dccp and use only sockaddr_in (and
sockaddr_in6 in the future).
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moving the setting of DCCP_SKB_CB(skb)->dccpd_reset_code to the places
where events happen that trigger sending a RESET packet.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliciting a SYNCACK in response, we were handling SYNC packets
only in the DCCP_OPEN state, in dccp_rcv_established.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
It is possible to receive more than one CLOSEREQ packet if the
CLOSE packet sent in response is somehow lost, change the state
to DCCP_CLOSING only on the first CLOSEREQ packet received.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Also use some BUG_ON where appropriate and use LIMIT_NETDEBUG for the unlikely
cases where we, at this stage, want to know about, that in my tests hasn't
appeared in the radar.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
To match more closely what is described in RFC 3448.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: Ian McDonald <iam4@cs.waikato.ac.nz>
To start the timestamps with 0.0ms, easing the integer maths in the CCIDs, this
probably will be reworked to use the to be introduced struct timeval_offset
infrastructure out of skb_get_timestamp, etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
The initialization of ccid3hcrx_rtt to 5ms is just a bandaid, I'll continue
auditing the CCID3 HC rx codebase to fix this properly, probably I'll add a
feedback timer as suggested in the CCID3 draft.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
We can get this value in an TIMESTAMP_ECHO and/or in an ELAPSED_TIME option, if
receiving both give precendence to the biggest one.
In my tests they are very close if not equal at all times, so we may well think
about removing the code in CCID3 that inserts this option and leaving this to
the core, and perhaps even use just TIMESTAMP_ECHO including the elapsed time.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This makes the send rate calculations behave way more closely to what
is specified, with the jitter previously seen on x and x_recv
disappearing completely on non lossy setups.
This resembles the tcp_data_snd_check code, that possibly we'll end up
using in DCCP as well, perhaps moving this code to
inet_connection_sock.
For now I'm doing the simplest implementation tho.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that applications can set dccp_sock->dccps_pkt_size, that in turn
is used in the CCID3 half connection init routines to set
ccid3hc[tr]x_s and use it in its rate calculations.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Renaming it to dccp_rx_hist_detect_loss.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Renaming it to dccp_rx_hist_add_packet.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'll now take a look at the other proposed TFRC DCCP CCIDs to find
more code that is now in ccid3.c and move to this module, the loss
event rate, calc_X, etc most probably will be moved there.
The main goal of these changes is to pave the way for the
implementation of more TFRC based DCCP CCIDs and to shrink ccid3.c,
reducing its complexity and helping in getting it rock solid.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And put this into net/dccp/ccids/lib/, where packet_history.[ch] will also be
moved and then we'll have a tfrc_lib.ko module that will be used by
dccp_ccid3.ko and other CCIDs that are variations of TFRC (RFC 3448).
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid open coding this all over the place.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introducing functions to add to or subtract from a timeval variable
and renaming now_delta to timeval_new_delta that calls do_gettimeofday
and then timeval_delta, that should be used when there are several
deltas made relative to the current time or setting variables to it,
so as to avoid calling do_gettimeofday excessively.
I'm leaving these "timeval_" prefixed funcions internal to DCCP for a
while till we're sure there are no subtle bugs in it.
It also is more correct as it checks if the number of usecs added to
or subtracted from a tv_usec field is more than 2 seconds.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is not quite what I think we should have long term but improves
performance for now, so lets use it till we get CCID3 working well,
then we can think about using sk_write_queue, perhaps using some ideas
from Juwen Lai's old stack for 2.4.20.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch puts mostly read only data in the right section
(read_mostly), to help sharing of these data between CPUS without
memory ping pongs.
On one of my production machine, tcp_statistics was sitting in a
heavily modified cache line, so *every* SNMP update had to force a
reload.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tested with a patched netcat, no horror stories so far 8)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And also hc_tx and hc_rx get_info functions for the CCIDs to fill in
information that is specific to them.
For now reusing struct tcp_info, later I'll try to figure out a better
solution, for now its really nice to get this kind of info:
[root@qemu ~]# ./ss -danemi
State Recv-Q Send-Q Local Addr:Port Peer Addr:Port
LISTEN 0 0 *:5001 *:* ino:628 sk:c1340040
mem:(r0,w0,f0,t0) cwnd:0 ssthresh:0
ESTAB 0 0 172.20.0.2:5001 172.20.0.1:32785 ino:629 sk:c13409a0
mem:(r0,w0,f0,t0) ts rto:1000 rtt:0.004/0 cwnd:0 ssthresh:0 rcv_rtt:61.377
This, for instance, shows that we're not congestion controlling ACKs,
as the above output is in the ttcp receiving host, and ttcp is a one
way app, i.e. the received never calls sendmsg, so
ccid_hc_tx_send_packet is never called, so the TX half connection
stays in TFRC_SSTATE_NO_SENT state and hctx_rtt is never calculated,
stays with the value set in ccid3_hc_tx_init, 4us, as show above in
milliseconds (0.004ms), upcoming patches will fix this.
rcv_rtt seems sane tho, matching ping results :-)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using TIMESTAMP_ECHO and ELAPSED_TIME options received.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And don't insert a TIMESTAMP option in all packets, leave the decision
to the CCIDs.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>