Inspection of upper layer protocol is considered harmful, especially
if it is about ARP or other stateful upper layer protocol; driver
cannot (and should not) have full state of them.
IPv4 over Firewire module used to inspect ARP (both in sending path
and in receiving path), and record peer's GUID, max packet size, max
speed and fifo address. This patch removes such inspection by extending
our "hardware address" definition to include other information as well:
max packet size, max speed and fifo. By doing this, The neighbour
module in networking subsystem can cache them.
Note: As we have started ignoring sspd and max_rec in ARP/NDP, those
information will not be used in the driver when sending.
When a packet is being sent, the IP layer fills our pseudo header with
the extended "hardware address", including GUID and fifo. The driver
can look-up node-id (the real but rather volatile low-level address)
by GUID, and then the module can send the packet to the wire using
parameters provided in the extendedn hardware address.
This approach is realistic because IP over IEEE1394 (RFC2734) and IPv6
over IEEE1394 (RFC3146) share same "hardware address" format
in their address resolution protocols.
Here, extended "hardware address" is defined as follows:
union fwnet_hwaddr {
u8 u[16];
struct {
__be64 uniq_id; /* EUI-64 */
u8 max_rec; /* max packet size */
u8 sspd; /* max speed */
__be16 fifo_hi; /* hi 16bits of FIFO addr */
__be32 fifo_lo; /* lo 32bits of FIFO addr */
} __packed uc;
};
Note that Hardware address is declared as union, so that we can map full
IP address into this, when implementing MCAP (Multicast Cannel Allocation
Protocol) for IPv6, but IP and ARP subsystem do not need to know this
format in detail.
One difference between original ARP (RFC826) and 1394 ARP (RFC2734)
is that 1394 ARP Request/Reply do not contain the target hardware address
field (aka ar$tha). This difference is handled in the ARP subsystem.
CC: Stephan Gatzka <stephan.gatzka@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull to get the thermal netlink multicast group name fix, otherwise
the assertion added in net-next to netlink to detect that kind of bug
makes systems unbootable for some folks.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts the Marvell MV643XX ethernet driver to use the
Marvell Orion MDIO driver. As a result, PowerPC and ARM platforms
registering the Marvell MV643XX ethernet driver are also updated to
register a Marvell Orion MDIO driver. This driver voluntarily overlaps
with the Marvell Ethernet shared registers because it will use a subset
of this shared register (shared_base + 0x4 to shared_base + 0x84). The
Ethernet driver is also updated to look up for a PHY device using the
Orion MDIO bus driver.
For ARM and PowerPC we register a single instance of the "mvmdio" driver
in the system like it used to be done with the use of the "shared_smi"
platform_data cookie on ARM.
Note that it is safe to register the mvmdio driver only for the "ge00"
instance of the driver because this "ge00" interface is guaranteed to
always be explicitely registered by consumers of
arch/arm/plat-orion/common.c and other instances (ge01, ge10 and ge11)
were all pointing their shared_smi to ge00. For PowerPC the in-tree
Device Tree Source files mention only one MV643XX ethernet MAC instance
so the MDIO bus driver is registered only when id == 0.
Signed-off-by: Florian Fainelli <florian@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
You can access it directly now, since 3.8: v3.7-rc1-13-g06ca287
'virtio: move queue_index and num_free fields into core struct
virtqueue.'
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If bpf_jit_enable > 1, then we dump the emitted JIT compiled image
after creation. Currently, only SPARC and PowerPC has similar output
as in the reference implementation on x86_64. Make a small helper
function in order to reduce duplicated code and make the dump output
uniform across architectures x86_64, SPARC, PPC, ARM (e.g. on ARM
flen, pass and proglen are currently not shown, but would be
interesting to know as well), also for future BPF JIT implementations
on other archs.
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Matt Evans <matt@ozlabs.org>
Cc: Eric Dumazet <eric.dumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements F-RTO (foward RTO recovery):
When the first retransmission after timeout is acknowledged, F-RTO
sends new data instead of old data. If the next ACK acknowledges
some never-retransmitted data, then the timeout was spurious and the
congestion state is reverted. Otherwise if the next ACK selectively
acknowledges the new data, then the timeout was genuine and the
loss recovery continues. This idea applies to recurring timeouts
as well. While F-RTO sends different data during timeout recovery,
it does not (and should not) change the congestion control.
The implementaion follows the three steps of SACK enhanced algorithm
(section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
3 are in tcp_process_loss(). The basic version is not supported
because SACK enhanced version also works for non-SACK connections.
The new implementation is functionally in parity with the old F-RTO
implementation except the one case where it increases undo events:
In addition to the RFC algorithm, a spurious timeout may be detected
without sending data in step 2, as long as the SACK confirms not
all the original data are dropped. When this happens, the sender
will undo the cwnd and perhaps enter fast recovery instead. This
additional check increases the F-RTO undo events by 5x compared
to the prior implementation on Google Web servers, since the sender
often does not have new data to send for HTTP.
Note F-RTO may detect spurious timeout before Eifel with timestamps
does so.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch series refactor the F-RTO feature (RFC4138/5682).
This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features. It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).
The new code implements newer F-RTO RFC5682 using CA_Loss processing
path. F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently. F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.
The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation. Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Process connector can now also detect coredumping events.
Main aim of patch is get notified at start of coredumping, instead of
having to wait for it to finish and then being notified through EXIT
event.
Could be used for instance by process-managers that want to get
notified as soon as possible about process failures, and not
necessarily beeing notified after coredump, which could be in the
order of minutes depending on size of coredump, piping and so on.
Signed-off-by: Jesper Derehag <jderehag@hotmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is very useful to do dynamic truncation of packets. In particular,
we're interested to push the necessary header bytes to the user space and
cut off user payload that should probably not be transferred for some reasons
(e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET,
we can load it into the accumulator, and return it. E.g. in bpfc syntax ...
ld #poff ; { 0x20, 0, 0, 0xfffff034 },
ret a ; { 0x16, 0, 0, 0x00000000 },
... as a filter will accomplish this without having to do a big hackery in
a BPF filter itself. Follow-up JIT implementations are welcome.
Thanks to Eric Dumazet for suggesting and discussing this during the
Netfilter Workshop in Copenhagen.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_get_poff() returns the offset to the payload as far as it could
be dissected. The main user is currently BPF, so that we can dynamically
truncate packets without needing to push actual payload to the user
space and instead can analyze headers only.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Users of udp encapsulation currently have an encap_rcv callback which they can
use to hook into the udp receive path.
In situations where a encapsulation user allocates resources associated with a
udp encap socket, it may be convenient to be able to also hook the proto
.destroy operation. For example, if an encap user holds a reference to the
udp socket, the destroy hook might be used to relinquish this reference.
This patch adds a socket destroy hook into udp, which is set and enabled
in the same way as the existing encap_rcv hook.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix ARM BPF JIT handling of negative 'k' values, from Chen Gang.
2) Insufficient space reserved for bridge netlink values, fix from
Stephen Hemminger.
3) Some dst_neigh_lookup*() callers don't interpret error pointer
correctly, fix from Zhouyi Zhou.
4) Fix transport match in SCTP active_path loops, from Xugeng Zhang.
5) Fix qeth driver handling of multi-order SKB frags, from Frank
Blaschka.
6) fec driver is missing napi_disable() call, resulting in crashes on
unload, from Georg Hofmann.
7) Don't try to handle PMTU events on a listening socket, fix from Eric
Dumazet.
8) Fix timestamp location calculations in IP option processing, from
David Ward.
9) FIB_TABLE_HASHSZ setting is not controlled by the correct kconfig
tests, from Denis V Lunev.
10) Fix TX descriptor push handling in SFC driver, from Ben Hutchings.
11) Fix isdn/hisax and tulip/de4x5 kconfig dependencies, from Arnd
Bergmann.
12) bnx2x statistics don't handle 4GB rollover correctly, fix from
Maciej Żenczykowski.
13) Openvswitch bug fixes for vport del/new error reporting, missing
genlmsg_end() call in netlink processing, and mis-parsing of
LLC/SNAP ethernet types. From Rich Lane.
14) SKB pfmemalloc state should only be propagated from the head page of
a compound page, fix from Pavel Emelyanov.
15) Fix link handling in tg3 driver for 5715 chips when autonegotation
is disabled. From Nithin Sujir.
16) Fix inverted test of cpdma_check_free_tx_desc return value in
davinci_emac driver, from Mugunthan V N.
17) vlan_depth is incorrectly calculated in skb_network_protocol(), from
Li RongQing.
18) Fix probing of Gobi 1K devices in qmi_wwan driver, and fix NCM
device mode backwards compat in cdc_ncm driver. From Bjørn Mork.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
inet: limit length of fragment queue hash table bucket lists
qeth: Fix scatter-gather regression
qeth: Fix invalid router settings handling
qeth: delay feature trace
tcp: dont handle MTU reduction on LISTEN socket
bnx2x: fix occasional statistics off-by-4GB error
vhost/net: fix heads usage of ubuf_info
bridge: Add support for setting BR_ROOT_BLOCK flag.
bnx2x: add missing napi deletion in error path
drivers: net: ethernet: ti: davinci_emac: fix usage of cpdma_check_free_tx_desc()
ethernet/tulip: DE4x5 needs VIRT_TO_BUS
isdn: hisax: netjet requires VIRT_TO_BUS
net: cdc_ncm, cdc_mbim: allow user to prefer NCM for backwards compatibility
rtnetlink: Mask the rta_type when range checking
Revert "ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionally"
Fix dst_neigh_lookup/dst_neigh_lookup_skb return value handling bug
smsc75xx: configuration help incorrectly mentions smsc95xx
net: fec: fix missing napi_disable call
net: fec: restart the FEC when PHY speed changes
skb: Propagate pfmemalloc on skb from head page only
...
This fixes a couple of problems. Firstly, some people are actually still
using old small-page flash and we broke it by removing the ready check.
Secondly. fix the handling of partitions on Broadcom 47xx devices.
Recent changes had made it misdetect the location of the NVRAM and
scribble over the bootloader when it tried to update the variables there.
With predictably sad results.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iEYEABECAAYFAlFG/koACgkQdwG7hYl686OShQCgyOXCXhTltBJU3wHDtTeT1rc9
Vx0AoL655JpZaUf+BpPsCSTagH4k28vQ
=OK6b
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20130318' of git://git.infradead.org/linux-mtd
Pull MTD fixes from David Woodhouse:
"This fixes a couple of problems. Firstly, some people are actually
still using old small-page flash and we broke it by removing the ready
check.
Secondly. fix the handling of partitions on Broadcom 47xx devices.
Recent changes had made it misdetect the location of the NVRAM and
scribble over the bootloader when it tried to update the variables
there. With predictably sad results."
* tag 'for-linus-20130318' of git://git.infradead.org/linux-mtd:
mtd: nand: reintroduce NAND_NO_READRDY as NAND_NEED_READRDY
mtd: bcm47xxpart: look for NVRAM at the end of device
Revert "mtd: bcm47xxpart: improve probing of nvram partition"
Commit 1d9d8639c0 ("perf,x86: fix kernel crash with PEBS/BTS after
suspend/resume") introduces a link failure since
perf_restore_debug_store() is only defined for CONFIG_CPU_SUP_INTEL:
arch/x86/power/built-in.o: In function `restore_processor_state':
(.text+0x45c): undefined reference to `perf_restore_debug_store'
Fix it by defining the dummy function appropriately.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.
As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:
Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.
before this patch:
average (among 7 runs) of 20845.5 Requests/Second
after:
average (among 7 runs) of 21403.6 Requests/Second
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/openvswitch/vport-internal_dev.c
Jesse Gross says:
====================
A couple of minor enhancements for net-next/3.10. The largest is an
extension to allow variable length metadata to be passed to userspace
with packets.
There is a merge conflict in net/openvswitch/vport-internal_dev.c:
A existing commit modifies internal_dev_mac_addr() and a new commit
deletes it. The new one is correct, so you can just remove that function.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
commit bd329e1 ("net: cdc_ncm: do not bind to NCM compatible MBIM devices")
introduced a new policy, preferring MBIM for dual NCM/MBIM functions if
the cdc_mbim driver was enabled. This caused a regression for users
wanting to use NCM.
Devices implementing NCM backwards compatibility according to section
3.2 of the MBIM v1.0 specification allow either NCM or MBIM on a single
USB function, using different altsettings. The cdc_ncm and cdc_mbim
drivers will both probe such functions, and must agree on a common
policy for selecting either MBIM or NCM. Until now, this policy has
been set at build time based on CONFIG_USB_NET_CDC_MBIM.
Use a module parameter to set the system policy at runtime, allowing the
user to prefer NCM on systems with the cdc_mbim driver.
Cc: Greg Suarez <gsuarez@smithmicro.com>
Cc: Alexey Orishko <alexey.orishko@stericsson.com>
Reported-by: Geir Haatveit <nospam@haatveit.nu>
Reported-by: Tommi Kyntola <kynde@ts.ray.fi>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=54791
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this one we have:
- An ab8500 build failure fix.
- An ab8500 device tree parsing fix.
- A fix for twl4030_madc remove routine to work properly (when built-in).
- A fix for properly registering palmas interrupt handler.
- A fix for omap-usb init routine to actually write into the hostconfig
register.
- A couple of warning fixes for ab8500-gpadc and tps65912.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJRQs2IAAoJEIqAPN1PVmxKN6sQAISiEmSjOiLRZ3b13qeKGiMX
3F3W5qXGYJQzhiVU/unPK9/2rTdfJ0Mqnuj88bLY+336pr89vi8g0k2VBWPfHlcv
jv2mJKWuNUM15D0d6uZ3t27jrXXBzNEvMjKKen2kYfm1+JcR+1N8pbJUZtOlgREJ
2Z23or/XP+ZCSbXcvzH1KvRFrcaBDedqgT4m0nat2dTfZ77MpKZEA58sutNBOUMa
YfoE2ncJBq5Ku4LBo6wGhsVptzilmpH3YBnSEgTh7tccyLrt4SnodNdT3vxmU4+1
C0r/oX9Idcij4/VAW/2bcEq1E12YUeKnPeVDAARXc+X/1Oyw+Mc6Cqd/rnvFomfe
9uTGMOmH1me1mSUgrzFNM1nnKXaZkWti56rLEe8tLsN7tkc+oWSJPIlWwjCmeWkC
R5SPgR5Jf6QeA4mW9qMxnBM3y8YlaV4awW6wwVYXGUDICZMG04qURSLo9NX3kDAA
AHsbE+IiSxdDqmZr8aejhaGd7PIVjfPhnM25TNlUZ5P5EWlApl4ha7QZ/BK2fx6k
DqaRujfB2uGcQAwqTprVpDI6lHUEneUKaa3zDpoxkP4vPVXDsRIiBo7HRfY8QXjg
vMWrYfLvAYBXHx69I5q2DghyFmofu/H0SXDdGlo33cQ5SfzeNGS7h0pPGrkNJyvN
7WJbiOdAYeQXVlBCy7sB
=XFgy
-----END PGP SIGNATURE-----
Merge tag 'mfd-fixes-3.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-fixes
Pull MFD fixes from Samuel Ortiz:
"This is the first batch of MFD fixes for 3.9.
With this one we have:
- An ab8500 build failure fix.
- An ab8500 device tree parsing fix.
- A fix for twl4030_madc remove routine to work properly (when
built-in).
- A fix for properly registering palmas interrupt handler.
- A fix for omap-usb init routine to actually write into the
hostconfig register.
- A couple of warning fixes for ab8500-gpadc and tps65912"
* tag 'mfd-fixes-3.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-fixes:
mfd: twl4030-madc: Remove __exit_p annotation
mfd: ab8500: Kill "reg" property from binding
mfd: ab8500-gpadc: Complain if we fail to enable vtvout LDO
mfd: wm831x: Don't forward declare enum wm831x_auxadc
mfd: twl4030-audio: Fix argument type for twl4030_audio_disable_resource()
mfd: tps65912: Declare and use tps65912_irq_exit()
mfd: palmas: Provide irq flags through DT/platform data
mfd: Make AB8500_CORE select POWER_SUPPLY to fix build error
mfd: omap-usb-host: Actually update hostconfig
This patch fixes a kernel crash when using precise sampling (PEBS)
after a suspend/resume. Turns out the CPU notifier code is not invoked
on CPU0 (BP). Therefore, the DS_AREA (used by PEBS) is not restored properly
by the kernel and keeps it power-on/resume value of 0 causing any PEBS
measurement to crash when running on CPU0.
The workaround is to add a hook in the actual resume code to restore
the DS Area MSR value. It is invoked for all CPUS. So for all but CPU0,
the DS_AREA will be restored twice but this is harmless.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
COMPAT_NET_DEV_OPS was removed a while back and with it the definition of
netdev_resync_ops() went away. Let's finish the clean-up.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows LRO aggregation on bonded devices that contain an
NX3031 device. It also adds a for_each_netdev_in_bond_rcu(bond, slave)
macro which executes for each slave that has bond as master.
V3: After testing and discussing this with Rajesh, I decided to keep the
vlan ip cache and just rename it to ip_cache since it will store bond
ip addresses too. A new master flag has been added to the ip cache to
denote that the address has been added because of a master device.
I've taken care of the enslave/release cases by checking for various
combinations of events and flags (e.g. netxen has a master, it's a
bond master and it's not marked as a slave means it is being enslaved
and is dev_open()ed in bond_enslave).
I've changed netxen_free_ip_list() to have a "master" parameter which
causes all IP addresses marked as master to be deleted (used when a
netxen is being released). I've made the patch use the new upper
device API as well. The following cases were tested:
- bond -> netxen
- vlan -> netxen
- vlan -> bond -> netxen
V2: Remove local ip caching, retrieve addresses dynamically and
restore them if necessary.
Note: Tested with NX3031 adapter.
Tested-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: Andy Gospodarek <agospoda@redhat.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull fix for hlist_entry_safe() regression from Paul McKenney:
"This contains a single commit that fixes a regression in
hlist_entry_safe(). This macro references its argument twice, which
can cause NULL-pointer errors. This commit applies a gcc statement
expression, creating a temporary variable to avoid the double
reference. This has been posted to LKML at
https://lkml.org/lkml/2013/3/9/75.
Kudos to CAI Qian, whose testing uncovered this, to Eric Dumazet, who
spotted root cause, and to Li Zefan, who tested this commit."
* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
list: Fix double fetch of pointer in hlist_entry_safe()
The current version of hlist_entry_safe() fetches the pointer twice,
once to test for NULL and the other to compute the offset back to the
enclosing structure. This is OK for normal lock-based use because in
that case, the pointer cannot change. However, when the pointer is
protected by RCU (as in "rcu_dereference(p)"), then the pointer can
change at any time. This use case can result in the following sequence
of events:
1. CPU 0 invokes hlist_entry_safe(), fetches the RCU-protected
pointer as sees that it is non-NULL.
2. CPU 1 invokes hlist_del_rcu(), deleting the entry that CPU 0
just fetched a pointer to. Because this is the last entry
in the list, the pointer fetched by CPU 0 is now NULL.
3. CPU 0 refetches the pointer, obtains NULL, and then gets a
NULL-pointer crash.
This commit therefore applies gcc's "({ })" statement expression to
create a temporary variable so that the specified pointer is fetched
only once, avoiding the above sequence of events. Please note that
it is the caller's responsibility to use rcu_dereference() as needed.
This allows RCU-protected uses to work correctly without imposing
any additional overhead on the non-RCU case.
Many thanks to Eric Dumazet for spotting root cause!
Reported-by: CAI Qian <caiqian@redhat.com>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Li Zefan <lizefan@huawei.com>
Hi.
I'm trying to send big chunks of memory from application address space via
TCP socket using vmsplice + splice like this
mem = mmap(128Mb);
vmsplice(pipe[1], mem); /* splice memory into pipe */
splice(pipe[0], tcp_socket); /* send it into network */
When I'm lucky and a huge page splices into the pipe and then into the socket
_and_ client and server ends of the TCP connection are on the same host,
communicating via lo, the whole connection gets stuck! The sending queue
becomes full and app stops writing/splicing more into it, but the receiving
queue remains empty, and that's why.
The __skb_fill_page_desc observes a tail page of a huge page and erroneously
propagates its page->pfmemalloc value onto socket (the pfmemalloc on tail pages
contain garbage). Then this skb->pfmemalloc leaks through lo and due to the
tcp_v4_rcv
sk_filter
if (skb->pfmemalloc && !sock_flag(sk, SOCK_MEMALLOC)) /* true */
return -ENOMEM
goto release_and_discard;
no packets reach the socket. Even TCP re-transmits are dropped by this, as skb
cloning clones the pfmemalloc flag as well.
That said, here's the proper page->pfmemalloc propagation onto socket: we
must check the huge-page's head page only, other pages' pfmemalloc and mapping
values do not contain what is expected in this place. However, I'm not sure
whether this fix is _complete_, since pfmemalloc propagation via lo also
oesn't look great.
Both, bit propagation from page to skb and this check in sk_filter, were
introduced by c48a11c7 (netvm: propagate page->pfmemalloc to skb), in v3.5 so
Mel and stable@ are in Cc.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chrome OS team reported a crash on a Pixel ChromeBook in TCP stack :
https://code.google.com/p/chromium/issues/detail?id=182056
commit a21d45726a (tcp: avoid order-1 allocations on wifi and tx
path) did a poor choice adding an 'avail_size' field to skb, while
what we really needed was a 'reserved_tailroom' one.
It would have avoided commit 22b4a4f22d (tcp: fix retransmit of
partially acked frames) and this commit.
Crash occurs because skb_split() is not aware of the 'avail_size'
management (and should not be aware)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mukesh Agrawal <quiche@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This partially reverts commit 1696e6bc2a
("mtd: nand: kill NAND_NO_READRDY").
In that patch I overlooked a few things.
The original documentation for NAND_NO_READRDY included "True for all
large page devices, as they do not support autoincrement." I was
conflating "not support autoincrement" with the NAND_NO_AUTOINCR option,
which was in fact doing nothing. So, when I dropped NAND_NO_AUTOINCR, I
concluded that I then could harmlessly drop NAND_NO_READRDY. But of
course the fact the NAND_NO_AUTOINCR was doing nothing didn't mean
NAND_NO_READRDY was doing nothing...
So, NAND_NO_READRDY is re-introduced as NAND_NEED_READRDY and applied
only to those few remaining small-page NAND which needed it in the first
place.
Cc: stable@kernel.org [3.5+]
Reported-by: Alexander Shiyan <shc_work@mail.ru>
Tested-by: Alexander Shiyan <shc_work@mail.ru>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Merge misc fixes from Andrew Morton:
- A bunch of fixes
- Finish off the idr API conversions before someone starts to use the
old interfaces again.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
idr: idr_alloc() shouldn't trigger lowmem warning when preloaded
UAPI: fix endianness conditionals in M32R's asm/stat.h
UAPI: fix endianness conditionals in linux/raid/md_p.h
UAPI: fix endianness conditionals in linux/acct.h
UAPI: fix endianness conditionals in linux/aio_abi.h
decompressors: fix typo "POWERPC"
mm/fremap.c: fix oops on error path
idr: deprecate idr_pre_get() and idr_get_new[_above]()
tidspbridge: convert to idr_alloc()
zcache: convert to idr_alloc()
mlx4: remove leftover idr_pre_get() call
workqueue: convert to idr_alloc()
nfsd: convert to idr_alloc()
nfsd: remove unused get_new_stid()
kernel/signal.c: use __ARCH_HAS_SA_RESTORER instead of SA_RESTORER
signal: always clear sa_restorer on execve
mm: remove_memory(): fix end_pfn setting
include/linux/res_counter.h needs errno.h
Now that all in-kernel users are converted to ues the new alloc
interface, mark the old interface deprecated. We should be able to
remove these in a few releases.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alpha allmodconfig:
In file included from mm/memcontrol.c:28:
include/linux/res_counter.h: In function 'res_counter_set_limit':
include/linux/res_counter.h:203: error: 'EBUSY' undeclared (first use in this function)
include/linux/res_counter.h:203: error: (Each undeclared identifier is reported only once
include/linux/res_counter.h:203: error: for each function it appears in.)
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here are a number of tiny USB fixes and new USB device ids for your 3.9
tree.
The "largest" one here is a revert of a usb-storage patch that turned
out to be incorrect, breaking existing users, which is never a good
thing. Everything else is pretty simple and small
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlFA3rgACgkQMUfUDdst+ylSmQCfdQGXwi/1JX0099FKsnt4dcXY
SbMAn1GWSwYPo1Uk5joKJpNh412PMnXZ
=kNF7
-----END PGP SIGNATURE-----
Merge tag 'usb-3.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg Kroah-Hartman:
"Here are a number of tiny USB fixes and new USB device ids for your
3.9 tree.
The "largest" one here is a revert of a usb-storage patch that turned
out to be incorrect, breaking existing users, which is never a good
thing. Everything else is pretty simple and small"
* tag 'usb-3.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (43 commits)
USB: quatech2: only write to the tty if the port is open.
qcserial: bind to DM/DIAG port on Gobi 1K devices
USB: cdc-wdm: fix buffer overflow
usb: serial: Add Rigblaster Advantage to device table
qcaux: add Franklin U600
usb: musb: core: fix possible build error with randconfig
usb: cp210x new Vendor/Device IDs
usb: gadget: pxa25x: fix disconnect reporting
usb: dwc3: ep0: fix sparc64 build
usb: c67x00 RetryCnt value in c67x00 TD should be 3
usb: Correction to c67x00 TD data length mask
usb: Makefile: fix drivers/usb/phy/ Makefile entry
USB: added support for Cinterion's products AH6 and PLS8
usb: gadget: fix omap_udc build errors
USB: storage: fix Huawei mode switching regression
USB: storage: in-kernel modeswitching is deprecated
tools: usb: ffs-test: Fix build failure
USB: option: add Huawei E5331
usb: musb: omap2430: fix sparse warning
usb: musb: omap2430: fix omap_musb_mailbox glue check again
...
Here are some drivers/staging and drivers/iio fixes for 3.9 (the two are
still pretty intertwined, hence them coming both from my tree still.)
Nothing major, just a few things that have been reported by users, all
of these have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlFA3+8ACgkQMUfUDdst+ymO3gCdHvj7ucFy04bdR38hSgnKxcWt
+70AoMX/kr0jSJgi6IKbBpH3o7LlSouz
=o79p
-----END PGP SIGNATURE-----
Merge tag 'staging-3.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging tree fixes from Greg Kroah-Hartman:
"Here are some drivers/staging and drivers/iio fixes for 3.9 (the two
are still pretty intertwined, hence them coming both from my tree
still.) Nothing major, just a few things that have been reported by
users, all of these have been in linux-next for a while."
* tag 'staging-3.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: comedi: dt9812: use CR_CHAN() for channel number
staging/vt6656: Fix too large integer constant warning on 32-bit
staging: comedi: drivers: usbduxsigma.c: fix DMA buffers on stack
staging: imx/drm: request irq only after adding the crtc
staging: comedi: drivers: usbduxfast.c: fix for DMA buffers on stack
staging: comedi: drivers: usbdux.c: fix DMA buffers on stack
staging: vt6656: Fix oops on resume from suspend.
iio:common:st_sensors fixed all warning messages about uninitialized variables
iio: Fix build error seen if IIO_TRIGGER is defined but IIO_BUFFER is not
iio/imu: inv_mpu6050 depends on IIO_BUFFER
iio:ad5064: Initialize register cache correctly
iio:ad5064: Fix off by one in DAC value range check
iio:ad5064: Fix address of the second channel for ad5065/ad5045/ad5025
Change cpts_active_slave to active_slave so that the same DT property
can be used to ethtool and SIOCGMIIPHY.
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix new kernel-doc warnings in idr:
Warning(include/linux/idr.h:113): No description found for parameter 'idr'
Warning(include/linux/idr.h:113): Excess function parameter 'idp' description in 'idr_find'
Warning(lib/idr.c:232): Excess function parameter 'id' description in 'sub_alloc'
Warning(lib/idr.c:232): Excess function parameter 'id' description in 'sub_alloc'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows ethernet drivers (such as the mv643xx_eth) to support
Wake on LAN on platforms where PHY registers have to be configured
for Wake on LAN (e.g. the Marvell Kirkwood based qnap TS-119P II).
Signed-off-by: Michael Stapelberg <michael@stapelberg.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the second of the TLP patch series; it augments the basic TLP
algorithm with a loss detection scheme.
This patch implements a mechanism for loss detection when a Tail
loss probe retransmission plugs a hole thereby masking packet loss
from the sender. The loss detection algorithm relies on counting
TLP dupacks as outlined in Sec. 3 of:
http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
The basic idea is: Sender keeps track of TLP "episode" upon
retransmission of a TLP packet. An episode ends when the sender receives
an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
episode. We want to make sure that before the episode ends the sender
receives a "TLP dupack", indicating that the TLP retransmission was
unnecessary, so there was no loss/hole that needed plugging. If the
sender gets no TLP dupack before the end of the episode, then it reduces
ssthresh and the congestion window, because the TLP packet arriving at
the receiver probably plugged a hole.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Micrel PHY KSZ8031 is similar to KSZ8021 and also requires the special
initialization of "Operation Mode Strap Override" in reg 0x16
introduced in 212ea99 (phy/micrel: Implement support for KSZ8021).
Signed-off-by: Hector Palacios <hector.palacios@digi.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/intel/e1000e/netdev.c
Minor conflict in e1000e, a line that got fixed in 'net'
has been removed in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up interrupts on exit, silencing a sparse warning caused by
tps65912_irq_exit() being defined but not prototyped as we go.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Currently driver sets the irq type to IRQF_TRIGGER_LOW which is
causing interrupt registration failure in ARM based SoCs as:
[ 0.208479] genirq: Setting trigger mode 8 for irq 118 failed (gic_set_type+0x0/0xf0)
[ 0.208513] dummy 0-0059: Failed to request IRQ 118: -22
Provide the irq flags through platform data if device is registered
through board file or get the irq type from DT node property in place
of hardcoding the irq flag in driver to support multiple platforms.
Also configure the device to generate the interrupt signal according to
flag type.
Signed-off-by: Laxman Dewangan <ldewangan@nvidia.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Some LLCP services (e.g. the validation ones) require some control over
the LLCP link parameters like the receive window (RW) or the MIU extension
(MIUX). This can only be done through socket options.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
tw_cookie_values is never used in the TCP-stack.
It was added by 435cf559f (TCPCT part 1d: define TCP cookie option,
extend existing struct's), but already at that time it was not used at
all, nor mentioned in the commit-message.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull namespace bugfixes from Eric Biederman:
"This is three simple fixes against 3.9-rc1. I have tested each of
these fixes and verified they work correctly.
The userns oops in key_change_session_keyring and the BUG_ON triggered
by proc_ns_follow_link were found by Dave Jones.
I am including the enhancement for mount to only trigger requests of
filesystem modules here instead of delaying this for the 3.10 merge
window because it is both trivial and the kind of change that tends to
bit-rot if left untouched for two months."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Use nd_jump_link in proc_ns_follow_link
fs: Limit sys_mount to only request filesystem modules (Part 2).
fs: Limit sys_mount to only request filesystem modules.
userns: Stop oopsing in key_change_session_keyring
Adds generic tunneling offloading support for IPv4-UDP based
tunnels.
GSO type is added to request this offload for a skb.
netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded
udp-tunnel support. Currently no device supports this feature,
software offload is used.
This can be used by tunneling protocols like VXLAN.
CC: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>