Commit Graph

2739 Commits

Author SHA1 Message Date
Amerigo Wang
68bf9f0b91 pktgen: set different default min_pkt_size for different protocols
ETH_ZLEN is too small for IPv6, so this default value is not
suitable.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10 22:33:30 -04:00
Amerigo Wang
5aa8b57200 pktgen: fix crash when generating IPv6 packets
For IPv6, sizeof(struct ipv6hdr) = 40, thus the following
expression will result negative:

        datalen = pkt_dev->cur_pkt_size - 14 -
                  sizeof(struct ipv6hdr) - sizeof(struct udphdr) -
                  pkt_dev->pkt_overhead;

And,  the check "if (datalen < sizeof(struct pktgen_hdr))" will be
passed as "datalen" is promoted to unsigned, therefore will cause
a crash later.

This is a quick fix by checking if "datalen" is negative. The following
patch will increase the default value of 'min_pkt_size' for IPv6.

This bug should exist for a long time, so Cc -stable too.

Cc: <stable@vger.kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10 22:33:30 -04:00
Florian Zumbiehl
48cc32d38a vlan: don't deliver frames for unknown vlans to protocols
6a32e4f9dd made the vlan code skip marking
vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if
there was an rx_handler, as the rx_handler could cause the frame to be received
on a different (virtual) vlan-capable interface where that vlan might be
configured.

As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause
frames for unknown vlans to be delivered to the protocol stack as if they had
been received untagged.

For example, if an ipv6 router advertisement that's tagged for a locally not
configured vlan is received on an interface with macvlan interfaces attached,
macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the
macvlan interfaces, which caused it to be passed to the protocol stack, leading
to ipv6 addresses for the announced prefix being configured even though those
are completely unusable on the underlying interface.

The fix moves marking as PACKET_OTHERHOST after the rx_handler so the
rx_handler, if there is one, sees the frame unchanged, but afterwards,
before the frame is delivered to the protocol stack, it gets marked whether
there is an rx_handler or not.

Signed-off-by: Florian Zumbiehl <florz@florz.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08 15:21:55 -04:00
Eric Dumazet
2e71a6f808 net: gro: selective flush of packets
Current GRO can hold packets in gro_list for almost unlimited
time, in case napi->poll() handler consumes its budget over and over.

In this case, napi_complete()/napi_gro_flush() are not called.

Another problem is that gro_list is flushed in non friendly way :
We scan the list and complete packets in the reverse order.
(youngest packets first, oldest packets last)
This defeats priorities that sender could have cooked.

Since GRO currently only store TCP packets, we dont really notice the
bug because of retransmits, but this behavior can add unexpected
latencies, particularly on mice flows clamped by elephant flows.

This patch makes sure no packet can stay more than 1 ms in queue, and
only in stress situations.

It also complete packets in the right order to minimize latencies.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08 14:51:51 -04:00
Eric Dumazet
ca07e43e28 net: gro: fix a potential crash in skb_gro_reset_offset
Before accessing skb first fragment, better make sure there
is one.

This is probably not needed for old kernels, since an ethernet frame
cannot contain only an ethernet header, but the recent GRO addition
to tunnels makes this patch needed.

Also skb_gro_reset_offset() can be static, it actually allows
compiler to inline it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-07 14:49:17 -04:00
ramesh.nagappa@gmail.com
e1f165032c net: Fix skb_under_panic oops in neigh_resolve_output
The retry loop in neigh_resolve_output() and neigh_connected_output()
call dev_hard_header() with out reseting the skb to network_header.
This causes the retry to fail with skb_under_panic. The fix is to
reset the network_header within the retry loop.

Signed-off-by: Ramesh Nagappa <ramesh.nagappa@ericsson.com>
Reviewed-by: Shawn Lu <shawn.lu@ericsson.com>
Reviewed-by: Robert Coulson <robert.coulson@ericsson.com>
Reviewed-by: Billie Alsup <billie.alsup@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-07 14:42:39 -04:00
Eric Dumazet
acb600def2 net: remove skb recycling
Over time, skb recycling infrastructure got litle interest and
many bugs. Generic rx path skb allocation is now using page
fragments for efficient GRO / TCP coalescing, and recyling
a tx skb for rx path is not worth the pain.

Last identified bug is that fat skbs can be recycled
and it can endup using high order pages after few iterations.

With help from Maxime Bizon, who pointed out that commit
87151b8689 (net: allow pskb_expand_head() to get maximum tailroom)
introduced this regression for recycled skbs.

Instead of fixing this bug, lets remove skb recycling.

Drivers wanting really hot skbs should use build_skb() anyway,
to allocate/populate sk_buff right before netif_receive_skb()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-07 00:40:54 -04:00
Linus Torvalds
aab174f0df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - big one - consolidation of descriptor-related logics; almost all of
   that is moved to fs/file.c

   (BTW, I'm seriously tempted to rename the result to fd.c.  As it is,
   we have a situation when file_table.c is about handling of struct
   file and file.c is about handling of descriptor tables; the reasons
   are historical - file_table.c used to be about a static array of
   struct file we used to have way back).

   A lot of stray ends got cleaned up and converted to saner primitives,
   disgusting mess in android/binder.c is still disgusting, but at least
   doesn't poke so much in descriptor table guts anymore.  A bunch of
   relatively minor races got fixed in process, plus an ext4 struct file
   leak.

 - related thing - fget_light() partially unuglified; see fdget() in
   there (and yes, it generates the code as good as we used to have).

 - also related - bits of Cyrill's procfs stuff that got entangled into
   that work; _not_ all of it, just the initial move to fs/proc/fd.c and
   switch of fdinfo to seq_file.

 - Alex's fs/coredump.c spiltoff - the same story, had been easier to
   take that commit than mess with conflicts.  The rest is a separate
   pile, this was just a mechanical code movement.

 - a few misc patches all over the place.  Not all for this cycle,
   there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
  MAX_LFS_FILESIZE should be a loff_t
  compat: fs: Generic compat_sys_sendfile implementation
  fs: push rcu_barrier() from deactivate_locked_super() to filesystems
  btrfs: reada_extent doesn't need kref for refcount
  coredump: move core dump functionality into its own file
  coredump: prevent double-free on an error path in core dumper
  usb/gadget: fix misannotations
  fcntl: fix misannotations
  ceph: don't abuse d_delete() on failure exits
  hypfs: ->d_parent is never NULL or negative
  vfs: delete surplus inode NULL check
  switch simple cases of fget_light to fdget
  new helpers: fdget()/fdput()
  switch o2hb_region_dev_write() to fget_light()
  proc_map_files_readdir(): don't bother with grabbing files
  make get_file() return its argument
  vhost_set_vring(): turn pollstart/pollstop into bool
  switch prctl_set_mm_exe_file() to fget_light()
  switch xfs_find_handle() to fget_light()
  switch xfs_swapext() to fget_light()
  ...
2012-10-02 20:25:04 -07:00
Linus Torvalds
aecdc33e11 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:

 1) GRE now works over ipv6, from Dmitry Kozlov.

 2) Make SCTP more network namespace aware, from Eric Biederman.

 3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

 4) Make openvswitch network namespace aware, from Pravin B Shelar.

 5) IPV6 NAT implementation, from Patrick McHardy.

 6) Server side support for TCP Fast Open, from Jerry Chu and others.

 7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

 8) Increate the loopback default MTU to 64K, from Eric Dumazet.

 9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic.  This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail.  Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
  hyperv: Add buffer for extended info after the RNDIS response message.
  hyperv: Report actual status in receive completion packet
  hyperv: Remove extra allocated space for recv_pkt_list elements
  hyperv: Fix page buffer handling in rndis_filter_send_request()
  hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
  hyperv: Fix the max_xfer_size in RNDIS initialization
  vxlan: put UDP socket in correct namespace
  vxlan: Depend on CONFIG_INET
  sfc: Fix the reported priorities of different filter types
  sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
  sfc: Fix loopback self-test with separate_tx_channels=1
  sfc: Fix MCDI structure field lookup
  sfc: Add parentheses around use of bitfield macro arguments
  sfc: Fix null function pointer in efx_sriov_channel_type
  vxlan: virtual extensible lan
  igmp: export symbol ip_mc_leave_group
  netlink: add attributes to fdb interface
  tg3: unconditionally select HWMON support when tg3 is enabled.
  Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
  gre: fix sparse warning
  ...
2012-10-02 13:38:27 -07:00
Linus Torvalds
437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Linus Torvalds
68d47a137c Merge branch 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup hierarchy update from Tejun Heo:
 "Currently, different cgroup subsystems handle nested cgroups
  completely differently.  There's no consistency among subsystems and
  the behaviors often are outright broken.

  People at least seem to agree that the broken hierarhcy behaviors need
  to be weeded out if any progress is gonna be made on this front and
  that the fallouts from deprecating the broken behaviors should be
  acceptable especially given that the current behaviors don't make much
  sense when nested.

  This patch makes cgroup emit warning messages if cgroups for
  subsystems with broken hierarchy behavior are nested to prepare for
  fixing them in the future.  This was put in a separate branch because
  more related changes were expected (didn't make it this round) and the
  memory cgroup wanted to pull in this and make changes on top."

* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
2012-10-02 10:52:28 -07:00
Linus Torvalds
c0e8a139a5 Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - xattr support added.  The implementation is shared with tmpfs.  The
   usage is restricted and intended to be used to manage per-cgroup
   metadata by system software.  tmpfs changes are routed through this
   branch with Hugh's permission.

 - cgroup subsystem ID handling simplified.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
  cgroup: Assign subsystem IDs during compile time
  cgroup: Do not depend on a given order when populating the subsys array
  cgroup: Wrap subsystem selection macro
  cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
  cgroup: net_prio: Do not define task_netpioidx() when not selected
  cgroup: net_cls: Do not define task_cls_classid() when not selected
  cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
  cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
  xattr: mark variable as uninitialized to make both gcc and smatch happy
  fs: add missing documentation to simple_xattr functions
  cgroup: add documentation on extended attributes usage
  cgroup: rename subsys_bits to subsys_mask
  cgroup: add xattr support
  cgroup: revise how we re-populate root directory
  xattr: extract simple_xattr code from tmpfs
2012-10-02 10:50:47 -07:00
Linus Torvalds
033d9959ed Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "This is workqueue updates for v3.7-rc1.  A lot of activities this
  round including considerable API and behavior cleanups.

   * delayed_work combines a timer and a work item.  The handling of the
     timer part has always been a bit clunky leading to confusing
     cancelation API with weird corner-case behaviors.  delayed_work is
     updated to use new IRQ safe timer and cancelation now works as
     expected.

   * Another deficiency of delayed_work was lack of the counterpart of
     mod_timer() which led to cancel+queue combinations or open-coded
     timer+work usages.  mod_delayed_work[_on]() are added.

     These two delayed_work changes make delayed_work provide interface
     and behave like timer which is executed with process context.

   * A work item could be executed concurrently on multiple CPUs, which
     is rather unintuitive and made flush_work() behavior confusing and
     half-broken under certain circumstances.  This problem doesn't
     exist for non-reentrant workqueues.  While non-reentrancy check
     isn't free, the overhead is incurred only when a work item bounces
     across different CPUs and even in simulated pathological scenario
     the overhead isn't too high.

     All workqueues are made non-reentrant.  This removes the
     distinction between flush_[delayed_]work() and
     flush_[delayed_]_work_sync().  The former is now as strong as the
     latter and the specified work item is guaranteed to have finished
     execution of any previous queueing on return.

   * In addition to the various bug fixes, Lai redid and simplified CPU
     hotplug handling significantly.

   * Joonsoo introduced system_highpri_wq and used it during CPU
     hotplug.

  There are two merge commits - one to pull in IRQ safe timer from
  tip/timers/core and the other to pull in CPU hotplug fixes from
  wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
  workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
  workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
  workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
  workqueue: remove @delayed from cwq_dec_nr_in_flight()
  workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
  workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
  workqueue: use __cpuinit instead of __devinit for cpu callbacks
  workqueue: rename manager_mutex to assoc_mutex
  workqueue: WORKER_REBIND is no longer necessary for idle rebinding
  workqueue: WORKER_REBIND is no longer necessary for busy rebinding
  workqueue: reimplement idle worker rebinding
  workqueue: deprecate __cancel_delayed_work()
  workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
  workqueue: use mod_delayed_work() instead of __cancel + queue
  workqueue: use irqsafe timer for delayed_work
  workqueue: clean up delayed_work initializers and add missing one
  workqueue: make deferrable delayed_work initializer names consistent
  workqueue: cosmetic whitespace updates for macro definitions
  workqueue: deprecate system_nrt[_freezable]_wq
  workqueue: deprecate flush[_delayed]_work_sync()
  ...
2012-10-02 09:54:49 -07:00
stephen hemminger
edc7d57327 netlink: add attributes to fdb interface
Later changes need to be able to refer to neighbour attributes
when doing fdb_add.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-01 18:39:44 -04:00
Eric Dumazet
c9e6bc644e net: add gro_cells infrastructure
This adds a new include file (include/net/gro_cells.h), to bring GRO
(Generic Receive Offload) capability to tunnels, in a modular way.

Because tunnels receive path is lockless, and GRO adds a serialization
using a napi_struct, I chose to add an array of up to
DEFAULT_MAX_NUM_RSS_QUEUES cells, so that multi queue devices wont be
slowed down because of GRO layer.

skb_get_rx_queue() is used as selector.

In the future, we might add optional fanout capabilities, using rxhash
for example.

With help from Ben Hutchings who reminded me
netif_get_num_default_rss_queues() function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-01 17:01:46 -04:00
Weiping Pan
f4b549a5ac use skb_end_offset() in skb_try_coalesce()
Commit ec47ea824774(skb: Add inline helper for getting the skb end offset from
head) introduces this helper function, skb_end_offset(),
we should make use of it.

Signed-off-by: Weiping Pan <wpan@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-01 16:43:17 -04:00
Linus Torvalds
06d2fe153b Driver core merge for 3.7-rc1
Here is the big driver core update for 3.7-rc1.
 
 A number of firmware_class.c updates (as you saw a month or so ago), and
 some hyper-v updates and some printk fixes as well.  All patches that
 are outside of the drivers/base area have been acked by the respective
 maintainers, and have all been in the linux-next tree for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlBp3vkACgkQMUfUDdst+ylQoACgldktGFgkCLzH+rGYthrXOC5P
 9hUAnjmOhdoHlMTL81vWTlH+BrGernym
 =khrr
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core merge from Greg Kroah-Hartman:
 "Here is the big driver core update for 3.7-rc1.

  A number of firmware_class.c updates (as you saw a month or so ago),
  and some hyper-v updates and some printk fixes as well.  All patches
  that are outside of the drivers/base area have been acked by the
  respective maintainers, and have all been in the linux-next tree for a
  while.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
  memory: tegra{20,30}-mc: Fix reading incorrect register in mc_readl()
  device.h: Add missing inline to #ifndef CONFIG_PRINTK dev_vprintk_emit
  memory: emif: Add ifdef CONFIG_DEBUG_FS guard for emif_debugfs_[init|exit]
  Documentation: Fixes some translation error in Documentation/zh_CN/gpio.txt
  Documentation: Remove 3 byte redundant code at the head of the Documentation/zh_CN/arm/booting
  Documentation: Chinese translation of Documentation/video4linux/omap3isp.txt
  device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
  dev: Add dev_vprintk_emit and dev_printk_emit
  netdev_printk/netif_printk: Remove a superfluous logging colon
  netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
  dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack
  driver-core: Shut up dev_dbg_reatelimited() without DEBUG
  tools/hv: Parse /etc/os-release
  tools/hv: Check for read/write errors
  tools/hv: Fix exit() error code
  tools/hv: Fix file handle leak
  Tools: hv: Implement the KVP verb - KVP_OP_GET_IP_INFO
  Tools: hv: Rename the function kvp_get_ip_address()
  Tools: hv: Implement the KVP verb - KVP_OP_SET_IP_INFO
  Tools: hv: Add an example script to configure an interface
  ...
2012-10-01 12:10:44 -07:00
David S. Miller
6a06e5e1bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/team/team.c
	drivers/net/usb/qmi_wwan.c
	net/batman-adv/bat_iv_ogm.c
	net/ipv4/fib_frontend.c
	net/ipv4/route.c
	net/l2tp/l2tp_netlink.c

The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.

qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

With help from Antonio Quartulli.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-28 14:40:49 -04:00
Eric Dumazet
69b08f62e1 net: use bigger pages in __netdev_alloc_frag
We currently use percpu order-0 pages in __netdev_alloc_frag
to deliver fragments used by __netdev_alloc_skb()

Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096

Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :

- Better filling of space (the ending hole overhead is less an issue)

- Less calls to page allocator or accesses to page->_count

- Could allow struct skb_shared_info futures changes without major
  performance impact.

This patch implements a transparent fallback to smaller
pages in case of memory pressure.

It also uses a standard "struct page_frag" instead of a custom one.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 19:29:35 -04:00
Eric Dumazet
e2bcabec6e net: remove sk_init() helper
It seems sk_init() has no value today and even does strange things :

# grep . /proc/sys/net/core/?mem_*
/proc/sys/net/core/rmem_default:212992
/proc/sys/net/core/rmem_max:131071
/proc/sys/net/core/wmem_default:212992
/proc/sys/net/core/wmem_max:131071

We can remove it completely.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 18:42:00 -04:00
Al Viro
cb0942b812 make get_file() return its argument
simplifies a bunch of callers...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:25 -04:00
Al Viro
c3c073f808 new helper: iterate_fd()
iterates through the opened files in given descriptor table,
calling a supplied function; we stop once non-zero is returned.
Callback gets struct file *, descriptor number and const void *
argument passed to iterator.  It is called with files->file_lock
held, so it is not allowed to block.

tty_io, netprio_cgroup and selinux flush_unauthorized_files()
converted to its use.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:09:59 -04:00
Eric Dumazet
3e10986d1d net: guard tcp_set_keepalive() to tcp sockets
Its possible to use RAW sockets to get a crash in
tcp_set_keepalive() / sk_reset_timer()

Fix is to make sure socket is a SOCK_STREAM one.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 16:51:53 -04:00
Daniel Borkmann
9e49e88958 filter: add XOR instruction for use with X/K
SKF_AD_ALU_XOR_X has been added a while ago, but as an 'ancillary'
operation that is invoked through a negative offset in K within BPF
load operations. Since BPF_MOD has recently been added, BPF_XOR should
also be part of the common ALU operations. Removing SKF_AD_ALU_XOR_X
might not be an option since this is exposed to user space.

Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 16:49:21 -04:00
Eric Dumazet
5640f76858 net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 16:31:37 -04:00
Ed Cashin
c0d680e577 net: do not disable sg for packets requiring no checksum
A change in a series of VLAN-related changes appears to have
inadvertently disabled the use of the scatter gather feature of
network cards for transmission of non-IP ethernet protocols like ATA
over Ethernet (AoE).  Below is a reference to the commit that
introduces a "harmonize_features" function that turns off scatter
gather when the NIC does not support hardware checksumming for the
ethernet protocol of an sk buff.

  commit f01a5236bd
  Author: Jesse Gross <jesse@nicira.com>
  Date:   Sun Jan 9 06:23:31 2011 +0000

      net offloading: Generalize netif_get_vlan_features().

The can_checksum_protocol function is not equipped to consider a
protocol that does not require checksumming.  Calling it for a
protocol that requires no checksum is inappropriate.

The patch below has harmonize_features call can_checksum_protocol when
the protocol needs a checksum, so that the network layer is not forced
to perform unnecessary skb linearization on the transmission of AoE
packets.  Unnecessary linearization results in decreased performance
and increased memory pressure, as reported here:

  http://www.spinics.net/lists/linux-mm/msg15184.html

The problem has probably not been widely experienced yet, because
only recently has the kernel.org-distributed aoe driver acquired the
ability to use payloads of over a page in size, with the patchset
recently included in the mm tree:

  https://lkml.org/lkml/2012/8/28/140

The coraid.com-distributed aoe driver already could use payloads of
greater than a page in size, but its users generally do not use the
newest kernels.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-20 22:23:40 -04:00
Li RongQing
8ea853fd0b net/core: fix comment in skb_try_coalesce
It should be the skb which is not cloned

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:29:13 -04:00
Amerigo Wang
8c4c49df5c netpoll: call ->ndo_select_queue() in tx path
In netpoll tx path, we miss the chance of calling ->ndo_select_queue(),
thus could cause problems when bonding is involved.

This patch makes dev_pick_tx() extern (and rename it to netdev_pick_tx())
to let netpoll call it in netpoll_send_skb_on_dev().

Reported-by: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Tested-by: Sylvain Munaut <s.munaut@whatever-company.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:19:09 -04:00
stephen hemminger
6b6e27255f netdev: make address const in device address management
The internal functions for add/deleting addresses don't change
their argument.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 16:35:22 -04:00
Eric Dumazet
2c60db0370 net: provide a default dev->ethtool_ops
Instead of forcing device drivers to provide empty ethtool_ops or tweak
net/core/ethtool.c again, we could provide a generic ethtool_ops.

This occurred to me when I wanted to add GSO support to GRE tunnels.
ethtool -k support should be generic for all drivers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Maciej Żenczykowski <maze@google.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 15:40:15 -04:00
Gao feng
828de4f6bf net: dev: fix incorrect getting net device's name
When moving a nic from net namespace A to net namespace B,
in dev_change_net_namesapce,we call __dev_get_by_name to
decide if the netns B has the device has the same name.

if the netns B already has the same named device,we call
dev_get_valid_name to try to get a valid name for this nic in
the netns B,but net_device->nd_net still point to netns A now.

this patch fix it.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 15:37:01 -04:00
Eric Dumazet
b40863c667 net: more accurate network taps in transmit path
dev_queue_xmit_nit() should be called right before ndo_start_xmit()
calls or we might give wrong packet contents to taps users :

Packet checksum can be changed, or packet can be linearized or
segmented, and segments partially sent for the later case.

Also a memory allocation can fail and packet never really hit the
driver entry point.

Reported-by: Jamie Gloudon <jamie.gloudon@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 15:32:42 -04:00
Michael S. Tsirkin
0e698bf662 net: fix memory leak on oom with zerocopy
If orphan flags fails, we don't free the skb
on receive, which leaks the skb memory.

Return value was also wrong: netif_receive_skb
is supposed to return NET_RX_DROP, not ENOMEM.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18 16:24:00 -04:00
Eric W. Biederman
e1760bd5ff userns: Convert the audit loginuid to be a kuid
Always store audit loginuids in type kuid_t.

Print loginuids by converting them into uids in the appropriate user
namespace, and then printing the resulting uid.

Modify audit_get_loginuid to return a kuid_t.

Modify audit_set_loginuid to take a kuid_t.

Modify /proc/<pid>/loginuid on read to convert the loginuid into the
user namespace of the opener of the file.

Modify /proc/<pid>/loginud on write to convert the loginuid
rom the user namespace of the opener of the file.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Paul Moore <paul@paul-moore.com> ?
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-09-17 18:08:54 -07:00
Joe Perches
666f355f38 device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
Convert direct calls of vprintk_emit and printk_emit to the
dev_ equivalents.

Make create_syslog_header static.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-17 06:10:05 -07:00
Joe Perches
c2c5a7051c netdev_printk/netif_printk: Remove a superfluous logging colon
netdev_printk originally called dev_printk with %pV.

This style emitted the complete dev_printk header with
a colon followed by the netdev_name prefix followed
by a colon.

Now that netdev_printk does not call dev_printk, the
extra colon is superfluous.  Remove it.

Example:
old: sky2 0000:02:00.0: eth0: Link is up at 100 Mbps, full duplex, flow control both
new: sky2 0000:02:00.0 eth0: Link is up at 100 Mbps, full duplex, flow control both

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-17 06:10:05 -07:00
Joe Perches
b004ff4972 netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
A lot of stack is used in recursive printks with %pV.

Using multiple levels of %pV (a logging function with %pV
that calls another logging function with %pV) can consume
more stack than necessary.

Avoid excessive stack use by not calling dev_printk from
netdev_printk and dynamic_netdev_dbg.  Duplicate the logic
and form of dev_printk instead.

Make __netdev_printk static.
Remove EXPORT_SYMBOL(__netdev_printk)
Whitespace and brace style neatening.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-17 06:08:30 -07:00
David S. Miller
b48b63a1f6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/netfilter/nfnetlink_log.c
	net/netfilter/xt_LOG.c

Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.

Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-15 11:43:53 -04:00
Tejun Heo
8c7f6edbda cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
Currently, cgroup hierarchy support is a mess.  cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children.  blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup.  Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy.  Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support.  The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
  subsystems, so that implementing proper hierarchy support on those
  doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
  subsystems to implement proper hierarchy support.

For now, start with a single warning message.  We can whine louder
later on.

v2: Fixed a typo spotted by Michal. Warning message updated.

v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true.  Fixed a typo spotted by
    Glauber.

v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal.  Dropped unnecessary
    memcg root handling per Michal.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2012-09-14 12:01:16 -07:00
Daniel Wagner
8a8e04df47 cgroup: Assign subsystem IDs during compile time
WARNING: With this change it is impossible to load external built
controllers anymore.

In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
set, corresponding subsys_id should also be a constant. Up to now,
net_prio_subsys_id and net_cls_subsys_id would be of the type int and
the value would be assigned during runtime.

By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
to IS_ENABLED, all *_subsys_id will have constant value. That means we
need to remove all the code which assumes a value can be assigned to
net_prio_subsys_id and net_cls_subsys_id.

A close look is necessary on the RCU part which was introduces by
following patch:

  commit f845172531
  Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
  Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010

  cls_cgroup: Store classid in struct sock

  Tis code was added to init_cgroup_cls()

	  /* We can't use rcu_assign_pointer because this is an int. */
	  smp_wmb();
	  net_cls_subsys_id = net_cls_subsys.subsys_id;

  respectively to exit_cgroup_cls()

	  net_cls_subsys_id = -1;
	  synchronize_rcu();

  and in module version of task_cls_classid()

	  rcu_read_lock();
	  id = rcu_dereference(net_cls_subsys_id);
	  if (id >= 0)
		  classid = container_of(task_subsys_state(p, id),
					 struct cgroup_cls_state, css)->classid;
	  rcu_read_unlock();

Without an explicit explaination why the RCU part is needed. (The
rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
in a later commit, but that is a minor detail.)

So here is my pondering why it was introduced and why it safe to
remove it now. Note that this code was copied over to net_prio the
reasoning holds for that subsystem too.

The idea behind the RCU use for net_cls_subsys_id is to make sure we
get a valid pointer back from task_subsys_state(). task_subsys_state()
is just blindly accessing the subsys array and returning the
pointer. Obviously, passing in -1 as id into task_subsys_state()
returns an invalid value (out of lower bound).

So this code makes sure that only after module is loaded and the
subsystem registered, the id is assigned.

Before unregistering the module all old readers must have left the
critical section. This is done by assigning -1 to the id and issuing a
synchronized_rcu(). Any new readers wont call task_subsys_state()
anymore and therefore it is safe to unregister the subsystem.

The new code relies on the same trick, but it looks at the subsys
pointer return by task_subsys_state() (remember the id is constant
and therefore we allways have a valid index into the subsys
array).

No precautions need to be taken during module loading
module. Eventually, all CPUs will get a valid pointer back from
task_subsys_state() because rebind_subsystem() which is called after
the module init() function will assigned subsys[net_cls_subsys_id] the
newly loaded module subsystem pointer.

When the subsystem is about to be removed, rebind_subsystem() will
called before the module exit() function. In this case,
rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
and then it calls synchronize_rcu(). All old readers have left by then
the critical section. Any new reader wont access the subsystem
anymore.  At this point we are safe to unregister the subsystem. No
synchronize_rcu() call is needed.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
2012-09-14 09:57:43 -07:00
Daniel Wagner
51e4e7faba cgroup: net_prio: Do not define task_netpioidx() when not selected
task_netprioidx() should not be defined in case the configuration is
CONFIG_NETPRIO_CGROUP=n. The reason is that in a following patch the
net_prio_subsys_id will only be defined if CONFIG_NETPRIO_CGROUP!=n.
When net_prio is not built at all any callee should only get an empty
task_netprioidx() without any references to net_prio_subsys_id.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
2012-09-14 09:57:28 -07:00
Daniel Wagner
8fb974c937 cgroup: net_cls: Do not define task_cls_classid() when not selected
task_cls_classid() should not be defined in case the configuration is
CONFIG_NET_CLS_CGROUP=n. The reason is that in a following patch the
net_cls_subsys_id will only be defined if CONFIG_NET_CLS_CGROUP!=n.
When net_cls is not built at all a callee should only get an empty
task_cls_classid() without any references to net_cls_subsys_id.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
2012-09-14 09:57:25 -07:00
Nishank Trivedi
6af773e786 pktgen: fix crash with vlan and packet size less than 46
If vlan option is being specified in the pktgen and packet size
being requested is less than 46 bytes, despite being illogical
request, pktgen should not crash the kernel.

BUG: unable to handle kernel paging request at ffff88021fb82000
Process kpktgend_0 (pid: 1184, threadinfo ffff880215f1a000, task ffff880218544530)
Call Trace:
[<ffffffffa0637cd2>] ? pktgen_finalize_skb+0x222/0x300 [pktgen]
[<ffffffff814f0084>] ? build_skb+0x34/0x1c0
[<ffffffffa0639b11>] pktgen_thread_worker+0x5d1/0x1790 [pktgen]
[<ffffffffa03ffb10>] ? igb_xmit_frame_ring+0xa30/0xa30 [igb]
[<ffffffff8107ba20>] ? wake_up_bit+0x40/0x40
[<ffffffff8107ba20>] ? wake_up_bit+0x40/0x40
[<ffffffffa0639540>] ? spin+0x240/0x240 [pktgen]
[<ffffffff8107b4e3>] kthread+0x93/0xa0
[<ffffffff81615de4>] kernel_thread_helper+0x4/0x10
[<ffffffff8107b450>] ? flush_kthread_worker+0x80/0x80
[<ffffffff81615de0>] ? gs_change+0x13/0x13

The root cause of why pktgen is not able to handle this case is due
to comparison of signed (datalen) and unsigned data (sizeof), which
eventually passes a huge number to skb_put().

Signed-off-by: Nishank Trivedi <nistrive@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-13 17:10:00 -04:00
Srivatsa S. Bhat
f05ba7fccf netprio_cgroup: Use memcpy instead of the for-loop to copy priomap
Replace the current (inefficient) for-loop with memcpy, to copy priomap.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-13 16:18:40 -04:00
Srivatsa S. Bhat
d530d6df96 netprio_cgroup: Remove update_netdev_tables() since it is unnecessary
The update_netdev_tables() function appears to be unnecessary, since the
write_update_netdev_table() function will adjust the priomaps as and when
required anyway. So drop the usage of update_netdev_tables() entirely.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-13 16:18:40 -04:00
Randy Dunlap
1c463e57b3 net: fix net/core/sock.c build error
Fix net/core/sock.c build error when CONFIG_INET is not enabled:

net/built-in.o: In function `sock_edemux':
(.text+0xd396): undefined reference to `inet_twsk_put'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10 16:44:45 -04:00
Eric Dumazet
b6069a9570 filter: add MOD operation
Add a new ALU opcode, to compute a modulus.

Commit ffe06c17af used an ancillary to implement XOR_X,
but here we reserve one of the available ALU opcode to implement both
MOD_X and MOD_K

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: George Bakos <gbakos@alpinista.org>
Cc: Jay Schulist <jschlst@samba.org>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10 15:44:56 -04:00
Eric W. Biederman
15e473046c netlink: Rename pid to portid to avoid confusion
It is a frequent mistake to confuse the netlink port identifier with a
process identifier.  Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.

I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.

I have successfully built an allyesconfig kernel with this change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10 15:30:41 -04:00
Pablo Neira Ayuso
9f00d9776b netlink: hide struct module parameter in netlink_kernel_create
This patch defines netlink_kernel_create as a wrapper function of
__netlink_kernel_create to hide the struct module *me parameter
(which seems to be THIS_MODULE in all existing netlink subsystems).

Suggested by David S. Miller.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-08 18:46:30 -04:00
Pablo Neira Ayuso
9785e10aed netlink: kill netlink_set_nonroot
Replace netlink_set_nonroot by one new field `flags' in
struct netlink_kernel_cfg that is passed to netlink_kernel_create.

This patch also renames NL_NONROOT_* to NL_CFG_F_NONROOT_* since
now the flags field in nl_table is generic (so we can add more
flags if needed in the future).

Also adjust all callers in the net-next tree to use these flags
instead of netlink_set_nonroot.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-08 18:45:27 -04:00
Chema Gonzalez
6862234238 net: small bug on rxhash calculation
In the current rxhash calculation function, while the
sorting of the ports/addrs is coherent (you get the
same rxhash for packets sharing the same 4-tuple, in
both directions), ports and addrs are sorted
independently. This implies packets from a connection
between the same addresses but crossed ports hash to
the same rxhash.

For example, traffic between A=S:l and B=L:s is hashed
(in both directions) from {L, S, {s, l}}. The same
rxhash is obtained for packets between C=S:s and D=L:l.

This patch ensures that you either swap both addrs and ports,
or you swap none. Traffic between A and B, and traffic
between C and D, get their rxhash from different sources
({L, S, {l, s}} for A<->B, and {L, S, {s, l}} for C<->D)

The patch is co-written with Eric Dumazet <edumazet@google.com>

Signed-off-by: Chema Gonzalez <chema@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-08 18:41:48 -04:00
Eric W. Biederman
dbe9a4173e scm: Don't use struct ucred in NETLINK_CB and struct scm_cookie.
Passing uids and gids on NETLINK_CB from a process in one user
namespace to a process in another user namespace can result in the
wrong uid or gid being presented to userspace.  Avoid that problem by
passing kuids and kgids instead.

- define struct scm_creds for use in scm_cookie and netlink_skb_parms
  that holds uid and gid information in kuid_t and kgid_t.

- Modify scm_set_cred to fill out scm_creds by heand instead of using
  cred_to_ucred to fill out struct ucred.  This conversion ensures
  userspace does not get incorrect uid or gid values to look at.

- Modify scm_recv to convert from struct scm_creds to struct ucred
  before copying credential values to userspace.

- Modify __scm_send to populate struct scm_creds on in the scm_cookie,
  instead of just copying struct ucred from userspace.

- Modify netlink_sendmsg to copy scm_creds instead of struct ucred
  into the NETLINK_CB.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07 14:42:05 -04:00
Nikolay Aleksandrov
c6c13965f4 net: add unknown state to sysfs NIC duplex export
Currently when the NIC duplex state is DUPLEX_UNKNOWN it is exported as
full through sysfs, this patch adds support for DUPLEX_UNKNOWN. It is
handled the same way as in ethtool.

 Signed-off-by: Nikolay Aleksandrov <naleksan@redhat.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05 17:40:07 -04:00
Eric Dumazet
e812347ccf net: sock_edemux() should take care of timewait sockets
sock_edemux() can handle either a regular socket or a timewait socket

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-03 13:22:43 -04:00
Pablo Neira Ayuso
ace1fe1231 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
This merges (3f509c6 netfilter: nf_nat_sip: fix incorrect handling
of EBUSY for RTCP expectation) to Patrick McHardy's IPv6 NAT changes.
2012-09-03 15:34:51 +02:00
Jerry Chu
8336886f78 tcp: TCP Fast Open Server - support TFO listeners
This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 20:02:19 -04:00
Rami Rosen
d1a53dfd11 net: fix documentation of skb_needs_linearize().
skb_needs_linearize() does not check highmem DMA as it does not call
illegal_highdma() anymore, so there is no need to mention highmem DMA here.

(Indeed, ~NETIF_F_SG flag, which is checked in skb_needs_linearize(), can
be set when illegal_highdma() returns true, and we are assured that
illegal_highdma() is invoked prior to skb_needs_linearize() as
skb_needs_linearize() is a static method called only once.
But ~NETIF_F_SG can be set not only there in this same invocation path.
It can also be set when can_checksum_protocol() returns false).

see commit 02932ce9e2,
Convert skb_need_linearize() to use precomputed features.
Signed-off-by: Rami Rosen <rosenr@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 16:24:02 -04:00
David S. Miller
c32f38619a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge the 'net' tree to get the recent set of netfilter bug fixes in
order to assist with some merge hassles Pablo is going to have to deal
with for upcoming changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 15:14:18 -04:00
Eric Dumazet
ee13040901 netpoll: provide an IP ident in UDP frames
Let's fill IP header ident field with a meaningful value,
it might help some setups.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-30 12:32:14 -04:00
Gao feng
6549dd43c0 net: dev: fix the incorrect hold of net namespace's lo device
When moving a net device from one net namespace to another
net namespace,dev_change_net_namespace calls NETDEV_DOWN
event,so the original net namespace's dst entries which
beloned to this net device will be put into dst_garbage
list.

then dev_change_net_namespace will set this net device's
net to the new net namespace.

If we unregister this net device's driver, this will trigger
the NETDEV_UNREGISTER_FINAL event, dst_ifdown will be called,
and get this net device's dst entries from dst_garbage list,
put these entries' dev to the new net namespace's lo device.

It's not what we want,actually we need these dst entries hold
the original net namespace's lo device,this incorrect device
holding will trigger emg message like below.
unregister_netdevice: waiting for lo to become free. Usage count = 1

so we should call NETDEV_UNREGISTER_FINAL event in
dev_change_net_namespace too,in order to make sure dst entries
already in the dst_garbage list, we need rcu_barrier before we
call NETDEV_UNREGISTER_FINAL event.

With help form Eric Dumazet.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-30 12:21:16 -04:00
Patrick McHardy
58a317f106 netfilter: ipv6: add IPv6 NAT support
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:17 +02:00
Patrick McHardy
2cf545e835 net: core: add function for incremental IPv6 pseudo header checksum updates
Add inet_proto_csum_replace16 for incrementally updating IPv6 pseudo header
checksums for IPv6 NAT.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: David S. Miller <davem@davemloft.net>
2012-08-30 03:00:16 +02:00
Amerigo Wang
072a9c4860 netpoll: revert 6bdb7fe310 and fix be_poll() instead
Against -net.

In the patch "netpoll: re-enable irq in poll_napi()", I tried to
fix the following warning:

[100718.051041] ------------[ cut here ]------------
[100718.051048] WARNING: at kernel/softirq.c:159 local_bh_enable_ip+0x7d/0xb0()
(Not tainted)
[100718.051049] Hardware name: ProLiant BL460c G7
...
[100718.051068] Call Trace:
[100718.051073]  [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
[100718.051075]  [<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
[100718.051077]  [<ffffffff810747ed>] ? local_bh_enable_ip+0x7d/0xb0
[100718.051080]  [<ffffffff8150041b>] ? _spin_unlock_bh+0x1b/0x20
[100718.051085]  [<ffffffffa00ee974>] ? be_process_mcc+0x74/0x230 [be2net]
[100718.051088]  [<ffffffffa00ea68c>] ? be_poll_tx_mcc+0x16c/0x290 [be2net]
[100718.051090]  [<ffffffff8144fe76>] ? netpoll_poll_dev+0xd6/0x490
[100718.051095]  [<ffffffffa01d24a5>] ? bond_poll_controller+0x75/0x80 [bonding]
[100718.051097]  [<ffffffff8144fde5>] ? netpoll_poll_dev+0x45/0x490
[100718.051100]  [<ffffffff81161b19>] ? ksize+0x19/0x80
[100718.051102]  [<ffffffff81450437>] ? netpoll_send_skb_on_dev+0x157/0x240

by reenabling IRQ before calling ->poll, but it seems more
problems are introduced after that patch:

http://ozlabs.org/~akpm/stuff/IMG_20120824_122054.jpg
http://marc.info/?l=linux-netdev&m=134563282530588&w=2

So it is safe to fix be2net driver code directly.

This patch reverts the offending commit and fixes be_poll() by
avoid disabling BH there, this is okay because be_poll()
can be called either by poll_napi() which already disables
IRQ, or by net_rx_action() which already disables BH.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Tested-by: Sylvain Munaut <s.munaut@whatever-company.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-29 15:03:23 -04:00
David S. Miller
e6acb38480 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24 18:54:37 -04:00
Ben Hutchings
8f4cccbbd9 net: Set device operstate at registration time
The operstate of a device is initially IF_OPER_UNKNOWN and is updated
asynchronously by linkwatch after each change of carrier state
reported by the driver.  The default carrier state of a net device is
on, and this will never be changed on drivers that do not support
carrier detection, thus the operstate remains IF_OPER_UNKNOWN.

For devices that do support carrier detection, the driver must set the
carrier state to off initially, then poll the hardware state when the
device is opened.  However, we must not activate linkwatch for a
unregistered device, and commit b473001 ('net: Do not fire linkwatch
events until the device is registered.') ensured that we don't.  But
this means that the operstate for many devices that support carrier
detection remains IF_OPER_UNKNOWN when it should be IF_OPER_DOWN.

The same issue exists with the dormant state.

The proper initialisation sequence, avoiding a race with opening of
the device, is:

        rtnl_lock();
        rc = register_netdevice(dev);
        if (rc)
                goto out_unlock;
        netif_carrier_off(dev); /* or netif_dormant_on(dev) */
        rtnl_unlock();

but it seems silly that this should have to be repeated in so many
drivers.  Further, the operstate seen immediately after opening the
device may still be IF_OPER_UNKNOWN due to the asynchronous nature of
linkwatch.

Commit 22604c8 ('net: Fix for initial link state in 2.6.28') attempted
to fix this by setting the operstate synchronously, but it was
reverted as it could lead to deadlock.

This initialises the operstate synchronously at registration time
only.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24 12:46:13 -04:00
Neil Horman
3afa6d00fb cls_cgroup: Allow classifier cgroups to have their classid reset to 0
The network classifier cgroup initalizes each cgroups instance classid value to
0.  However, the sock_update_classid function only updates classid's in sockets
if the tasks cgroup classid is not zero, and if it differs from the current
classid.  The later check is to prevent cache line dirtying, but the former is
detrimental, as it prevents resetting a classid for a cgroup to 0.  While this
is not a common action, it has administrative usefulness (if the admin wants to
disable classification of a certain group temporarily for instance).

Easy fix, just remove the zero check.  Tested successfully by myself

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24 12:41:17 -04:00
Eric Dumazet
748e2d9396 net: reinstate rtnl in call_netdevice_notifiers()
Eric Biederman pointed out that not holding RTNL while calling
call_netdevice_notifiers() was racy.

This patch is a direct transcription his feedback
against commit 0115e8e30d (net: remove delay at device dismantle)

Thanks Eric !

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-23 09:24:42 -07:00
Eric Dumazet
0115e8e30d net: remove delay at device dismantle
I noticed extra one second delay in device dismantle, tracked down to
a call to dst_dev_event() while some call_rcu() are still in RCU queues.

These call_rcu() were posted by rt_free(struct rtable *rt) calls.

We then wait a little (but one second) in netdev_wait_allrefs() before
kicking again NETDEV_UNREGISTER.

As the call_rcu() are now completed, dst_dev_event() can do the needed
device swap on busy dst.

To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
after a rcu_barrier(), but outside of RTNL lock.

Use NETDEV_UNREGISTER_FINAL with care !

Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL

Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
IP cache removal.

With help from Gao feng

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 21:50:36 -07:00
David S. Miller
1304a7343b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-08-22 14:21:38 -07:00
Tejun Heo
e7c2f96744 workqueue: use mod_delayed_work() instead of __cancel + queue
Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().

Most conversions are straight-forward except for the following.

* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
  elaborate dancing around its delayed_work.  Collapse it such that
  linkwatch_work is queued for immediate execution if LW_URGENT and
  existing timer is kept otherwise.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-21 13:18:24 -07:00
Tejun Heo
203b42f731 workqueue: make deferrable delayed_work initializer names consistent
Initalizers for deferrable delayed_work are confused.

* __DEFERRED_WORK_INITIALIZER()
* DECLARE_DEFERRED_WORK()
* INIT_DELAYED_WORK_DEFERRABLE()

Rename them to

* __DEFERRABLE_WORK_INITIALIZER()
* DECLARE_DEFERRABLE_WORK()
* INIT_DEFERRABLE_WORK()

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-21 13:18:23 -07:00
Randy Dunlap
3de7a37b02 net/core/dev.c: fix kernel-doc warning
Fix kernel-doc warning:

Warning(net/core/dev.c:5745): No description found for parameter 'dev'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc:	"David S. Miller" <davem@davemloft.net>
Cc:	netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 03:00:55 -07:00
Eric Leblond
c0de08d042 af_packet: don't emit packet on orig fanout group
If a packet is emitted on one socket in one group of fanout sockets,
it is transmitted again. It is thus read again on one of the sockets
of the fanout group. This result in a loop for software which
generate packets when receiving one.
This retransmission is not the intended behavior: a fanout group
must behave like a single socket. The packet should not be
transmitted on a socket if it originates from a socket belonging
to the same fanout group.

This patch fixes the issue by changing the transmission check to
take fanout group info account.

Reported-by: Aleksandr Kotov <a1k@mail.ru>
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 02:37:29 -07:00
John Fastabend
476ad154f3 net: netprio: fix cgrp create and write priomap race
A race exists where creating cgroups and also updating the priomap
may result in losing a priomap update. This is because priomap
writers are not protected by rtnl_lock.

Move priority writer into rtnl_lock()/rtnl_unlock().

CC: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-16 14:56:11 -07:00
John Fastabend
48a87cc26c net: netprio: fd passed in SCM_RIGHTS datagram not set correctly
A socket fd passed in a SCM_RIGHTS datagram was not getting
updated with the new tasks cgrp prioidx. This leaves IO on
the socket tagged with the old tasks priority.

To fix this add a check in the scm recvmsg path to update the
sock cgrp prioidx with the new tasks value.

Thanks to Al Viro for catching this.

CC: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-16 14:56:11 -07:00
John Fastabend
f796c20cf6 net: netprio: fix files lock and remove useless d_path bits
Add lock to prevent a race with a file closing and also remove
useless and ugly sscanf code. The extra code was never needed
and the case it supposedly protected against is in fact handled
correctly by sock_from_file as pointed out by Al Viro.

CC: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-16 14:56:11 -07:00
Eric W. Biederman
976d020150 userns: Convert sock_i_uid to return a kuid_t
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-08-14 21:47:34 -07:00
Eric W. Biederman
d04a48b06d userns: Convert __dev_set_promiscuity to use kuids in audit logs
Cc: Klaus Heinrich Kiwi <klausk@br.ibm.com>
Cc: Eric Paris <eparis@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-08-14 21:42:40 -07:00
Eric W. Biederman
b2e4f544fd userns: Convert net/core/scm.c to use kuids and kgids
With the existence of kuid_t and kgid_t we can take this further
and remove the usage of struct cred altogether, ensuring we
don't get cache line misses from reference counts.   For now
however start simply and do a straight forward conversion
I can be certain is correct.

In cred_to_ucred use from_kuid_munged and from_kgid_munged
as these values are going directly to userspace and we want to use
the userspace safe values not -1 when reporting a value that does not
map.  The earlier conversion that used from_kuid was buggy in that
respect.  Oops.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-08-14 21:41:58 -07:00
Amerigo Wang
6bdb7fe310 netpoll: re-enable irq in poll_napi()
napi->poll() needs IRQ enabled, so we have to re-enable IRQ before
calling it.

Cc: David Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:33:33 -07:00
Amerigo Wang
689971b446 netpoll: handle vlan tags in netpoll tx and rx path
Without this patch, I can't get netconsole logs remotely over
vlan. The reason is probably we don't handle vlan tags in either
netpoll tx or rx path.

I am not sure if I use these vlan functions correctly, at
least this patch works.

Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:33:32 -07:00
Amerigo Wang
2899656b49 netpoll: take rcu_read_lock_bh() in netpoll_send_skb_on_dev()
This patch fixes several problems in the call path of
netpoll_send_skb_on_dev():

1. Disable IRQ's before calling netpoll_send_skb_on_dev().

2. All the callees of netpoll_send_skb_on_dev() should use
   rcu_dereference_bh() to dereference ->npinfo.

3. Rename arp_reply() to netpoll_arp_reply(), the former is too generic.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:33:31 -07:00
Amerigo Wang
57c5d46191 netpoll: take rcu_read_lock_bh() in netpoll_rx()
In __netpoll_rx(), it dereferences ->npinfo without rcu_dereference_bh(),
this patch fixes it by using the 'npinfo' passed from netpoll_rx()
where it is already dereferenced with rcu_dereference_bh().

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:33:31 -07:00
Amerigo Wang
38e6bc185d netpoll: make __netpoll_cleanup non-block
Like the previous patch, slave_disable_netpoll() and __netpoll_cleanup()
may be called with read_lock() held too, so we should make them
non-block, by moving the cleanup and kfree() to call_rcu_bh() callbacks.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:33:30 -07:00
Amerigo Wang
47be03a28c netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup()
slave_enable_netpoll() and __netpoll_setup() may be called
with read_lock() held, so should use GFP_ATOMIC to allocate
memory. Eric suggested to pass gfp flags to __netpoll_setup().

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:33:30 -07:00
Amerigo Wang
b7bc2a5b5b net: remove netdev_bonding_change()
I don't see any benifits to use netdev_bonding_change() than
using call_netdevice_notifiers() directly.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:28:24 -07:00
Amerigo Wang
ee89bab14e net: move and rename netif_notify_peers()
I believe net/core/dev.c is a better place for netif_notify_peers(),
because other net event notify functions also stay in this file.

And rename it to netdev_notify_peers().

Cc: David S. Miller <davem@davemloft.net>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:28:23 -07:00
Tejun Heo
41f63c5359 workqueue: use mod_delayed_work() instead of cancel + queue
Convert delayed_work users doing cancel_delayed_work() followed by
queue_delayed_work() to mod_delayed_work().

Most conversions are straight-forward.  Ones worth mentioning are,

* drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
  use mod_delayed_work() and cancel loop in
  edac_mc_reset_delay_period() is dropped.

* drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
  watchdog is active or not.  @fan_watchdog_active and related code
  dropped.

* drivers/power/charger-manager.c: Seemingly a lot of
  delayed_work_pending() abuse going on here.
  [delayed_]work_pending() are unsynchronized and racy when used like
  this.  I converted one instance in fullbatt_handler().  Please
  conver the rest so that it invokes workqueue APIs for the intended
  target state rather than trying to game work item pending state
  transitions.  e.g. if timer should be modified - call
  mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().

* drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
  simplified.  Note that round_jiffies() calls in this function are
  meaningless.  round_jiffies() work on absolute jiffies not delta
  delay used by delayed_work.

v2: Tomi pointed out that __cancel_delayed_work() users can't be
    safely converted to mod_delayed_work().  They could be calling it
    from irq context and if that happens while delayed_work_timer_fn()
    is running, it could deadlock.  __cancel_delayed_work() users are
    dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Roland Dreier <roland@kernel.org>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
2012-08-13 16:27:37 -07:00
Pavel Emelyanov
aa79e66eee net: Make ifindex generation per-net namespace
Strictly speaking this is only _really_ required for checkpoint-restore to
make loopback device always have the same index.

This change appears to be safe wrt "ifindex should be unique per-system"
concept, as all the ifindex usage is either already made per net namespace
of is explicitly limited with init_net only.

There are two cool side effects of this. The first one -- ifindices of
devices in container are always small, regardless of how many containers
we've started (and re-started) so far. The second one is -- we can speed
up the loopback ifidex access as shown in the next patch.

v2: Place ifindex right after dev_base_seq : avoid two holes and use the
    same cache line, dirtied in list_netdevice()/unlist_netdevice()

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-09 16:18:07 -07:00
Pavel Emelyanov
9c7dafbfab net: Allow to create links with given ifindex
Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
is not zero. I propose to allow requesting ifindices on link creation. This
is required by the checkpoint-restore to correctly restore a net namespace
(i.e. -- a container).

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-09 16:18:06 -07:00
Eric Dumazet
a399a80531 time: jiffies_delta_to_clock_t() helper to the rescue
Various /proc/net files sometimes report crazy timer values, expressed
in clock_t units.

This happens when an expired timer delta (expires - jiffies) is passed
to jiffies_to_clock_t().

This function has an overflow in :

return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);

commit cbbc719fcc (time: Change jiffies_to_clock_t() argument type
to unsigned long) only got around the problem.

As we cant output negative values in /proc/net/tcp without breaking
various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
that caps the negative delta to a 0 value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: hank <pyu@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-09 16:17:03 -07:00
Alexey Khoroshilov
7364e445f6 net/core: Fix potential memory leak in dev_set_alias()
Do not leak memory by updating pointer with potentially NULL realloc return value.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-08 16:06:23 -07:00
Eric Dumazet
a37e6e3449 net: force dst_default_metrics to const section
While investigating on network performance problems, I found this little
gem :

$ nm -v vmlinux | grep -1 dst_default_metrics
ffffffff82736540 b busy.46605
ffffffff82736560 B dst_default_metrics
ffffffff82736598 b dst_busy_list

Apparently, declaring a const array without initializer put it in
(writeable) bss section, in middle of possibly often dirtied cache
lines.

Since we really want dst_default_metrics be const to avoid any possible
false sharing and catch any buggy writes, I force a null initializer.

ffffffff818a4c20 R dst_default_metrics

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-08 16:00:28 -07:00
Ben Hutchings
1485348d24 tcp: Apply device TSO segment limit earlier
Cache the device gso_max_segs in sock::sk_gso_max_segs and use it to
limit the size of TSO skbs.  This avoids the need to fall back to
software GSO for local TCP senders.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-02 00:19:17 -07:00
Ben Hutchings
30b678d844 net: Allow driver to limit number of GSO segments per skb
A peer (or local user) may cause TCP to use a nominal MSS of as little
as 88 (actual MSS of 76 with timestamps).  Given that we have a
sufficiently prodigious local sender and the peer ACKs quickly enough,
it is nevertheless possible to grow the window for such a connection
to the point that we will try to send just under 64K at once.  This
results in a single skb that expands to 861 segments.

In some drivers with TSO support, such an skb will require hundreds of
DMA descriptors; a substantial fraction of a TX ring or even more than
a full ring.  The TX queue selected for the skb may stall and trigger
the TX watchdog repeatedly (since the problem skb will be retried
after the TX reset).  This particularly affects sfc, for which the
issue is designated as CVE-2012-3412.

Therefore:
1. Add the field net_device::gso_max_segs holding the device-specific
   limit.
2. In netif_skb_features(), if the number of segments is too high then
   mask out GSO features to force fall back to software GSO.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-02 00:19:17 -07:00
Linus Torvalds
ac694dbdbc Merge branch 'akpm' (Andrew's patch-bomb)
Merge Andrew's second set of patches:
 - MM
 - a few random fixes
 - a couple of RTC leftovers

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
  rtc/rtc-88pm80x: remove unneed devm_kfree
  rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
  mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
  tmpfs: distribute interleave better across nodes
  mm: remove redundant initialization
  mm: warn if pg_data_t isn't initialized with zero
  mips: zero out pg_data_t when it's allocated
  memcg: gix memory accounting scalability in shrink_page_list
  mm/sparse: remove index_init_lock
  mm/sparse: more checks on mem_section number
  mm/sparse: optimize sparse_index_alloc
  memcg: add mem_cgroup_from_css() helper
  memcg: further prevent OOM with too many dirty pages
  memcg: prevent OOM with too many dirty pages
  mm: mmu_notifier: fix freed page still mapped in secondary MMU
  mm: memcg: only check anon swapin page charges for swap cache
  mm: memcg: only check swap cache pages for repeated charging
  mm: memcg: split swapin charge function into private and public part
  mm: memcg: remove needless !mm fixup to init_mm when charging
  mm: memcg: remove unneeded shmem charge type
  ...
2012-07-31 19:25:39 -07:00
Linus Torvalds
3e9a97082f This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.  The goal is to
 addresses weaknesses discussed in the paper "Mining your Ps and Qs:
 Detection of Widespread Weak Keys in Network Devices", by Nadia
 Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will
 be published in the Proceedings of the 21st Usenix Security Symposium,
 August 2012.  (See https://factorable.net for more information and an
 extended version of the paper.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQF/0DAAoJENNvdpvBGATwIowQAOep9QKtLrBvb2lwIRVmeiy8
 lRf7V/tYZnz4FePbR0W92JQfKYkCV8yyOO0bmeRzWL3v4m+lRwDTSyA1DDyQMoH+
 LOMzvDKSLJMSXTXdSOIr1WYACphViCR/9CrbMBCKSkYfZLJ1MdaEDxT3rcpTGD0T
 6iknUweiSkHHhkerU5yQL7FKzD5kYUe0hsF47w7QVlHRHJsW2fsZqkFoh+RpnhNw
 03u+djxNGBo9qV81vZ9D1b0vA9uRlEjoWOOEG2XE4M2iq6TUySueA72dQnCwunfi
 3kG/u1Swv2dgq6aRrP3H7zdwhYSourGxziu3jNhEKwKEohrxYY7xjNX3RVeTqP67
 AzlKsOTWpRLIDrzjSLlb8VxRQiZewu8Unex3e1G+eo20sbcIObHGrxNp7K00zZvd
 QZiMHhOwItwFTe4lBO+XbqH2JKbL9/uJmwh5EipMpQTraKO9E6N3CJiUHjzBLo2K
 iGDZxRMKf4gVJRwDxbbP6D70JPVu8ZJ09XVIpsXQ3Z1xNqaMF0QdCmP3ty56q1o0
 NvkSXxPKrijZs8Sk0rVDqnJ3ll8PuDnXMv5eDtL42VT818I5WxESn9djjwEanGv0
 TYxbFub/NRxmPEE5B2Js5FBpqsLf5f282OSMeS/5WLBbnHJR1OoPoAhGVpHvxntC
 bi5FC1OolqhvzVIdsqgt
 =u7KM
 -----END PGP SIGNATURE-----

Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random

Pull random subsystem patches from Ted Ts'o:
 "This patch series contains a major revamp of how we collect entropy
  from interrupts for /dev/random and /dev/urandom.

  The goal is to addresses weaknesses discussed in the paper "Mining
  your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
  by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J.  Alex Halderman,
  which will be published in the Proceedings of the 21st Usenix Security
  Symposium, August 2012.  (See https://factorable.net for more
  information and an extended version of the paper.)"

Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
  random: mix in architectural randomness in extract_buf()
  dmi: Feed DMI table to /dev/random driver
  random: Add comment to random_initialize()
  random: final removal of IRQF_SAMPLE_RANDOM
  um: remove IRQF_SAMPLE_RANDOM which is now a no-op
  sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
  board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
  isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
  uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
  drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
  xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
  n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
  i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
  input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
  mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
  ...
2012-07-31 19:07:42 -07:00
Mel Gorman
c76562b670 netvm: prevent a stream-specific deadlock
This patch series is based on top of "Swap-over-NBD without deadlocking
v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon.  In diskless systems this is not an option so if swap if
required then swapping over the network is considered.  The two likely
scenarios are when blade servers are used as part of a cluster where the
form factor or maintenance costs do not allow the use of disks and thin
clients.

The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap but this is not always an option.  There is no
guarantee that the network attached storage (NAS) device is running Linux
or supports NBD.  However, it is likely that it supports NFS so there are
users that want support for swapping over NFS despite any performance
concern.  Some distributions currently carry patches that support swapping
over NFS but it would be preferable to support it in the mainline kernel.

Patch 1 avoids a stream-specific deadlock that potentially affects TCP.

Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
	reserves.

Patch 3 adds three helpers for filesystems to handle swap cache pages.
	For example, page_file_mapping() returns page->mapping for
	file-backed pages and the address_space of the underlying
	swap file for swap cache pages.

Patch 4 adds two address_space_operations to allow a filesystem
	to pin all metadata relevant to a swapfile in memory. Upon
	successful activation, the swapfile is marked SWP_FILE and
	the address space operation ->direct_IO is used for writing
	and ->readpage for reading in swap pages.

Patch 5 notes that patch 3 is bolting
	filesystem-specific-swapfile-support onto the side and that
	the default handlers have different information to what
	is available to the filesystem. This patch refactors the
	code so that there are generic handlers for each of the new
	address_space operations.

Patch 6 adds an API to allow a vector of kernel addresses to be
	translated to struct pages and pinned for IO.

Patch 7 adds support for using highmem pages for swap by kmapping
	the pages before calling the direct_IO handler.

Patch 8 updates NFS to use the helpers from patch 3 where necessary.

Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.

Patch 10 implements the new swapfile-related address_space operations
	for NFS and teaches the direct IO handler how to manage
	kernel addresses.

Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
	where appropriate.

Patch 12 fixes a NULL pointer dereference that occurs when using
	swap-over-NFS.

With the patches applied, it is possible to mount a swapfile that is on an
NFS filesystem.  Swap performance is not great with a swap stress test
taking roughly twice as long to complete than if the swap device was
backed by NBD.

This patch: netvm: prevent a stream-specific deadlock

It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit.  This will prevent SOCK_MEMALLOC
buffers from receiving data, which will prevent userspace from running,
which is needed to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.  Once
this change it applied, it is important that sockets that set
SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
If this happens, a warning is generated and the tokens reclaimed to avoid
accounting errors until the bug is fixed.

[davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:47 -07:00
Mel Gorman
b4b9e35585 netvm: set PF_MEMALLOC as appropriate during SKB processing
In order to make sure pfmemalloc packets receive all memory needed to
proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
This is limited to a subset of protocols that are expected to be used for
writing to swap.  Taps are not allowed to use PF_MEMALLOC as these are
expected to communicate with userspace processes which could be paged out.

[a.p.zijlstra@chello.nl: Ideas taken from various patches]
[jslaby@suse.cz: Lock imbalance fix]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:46 -07:00
Mel Gorman
c93bdd0e03 netvm: allow skb allocation to use PFMEMALLOC reserves
Change the skb allocation API to indicate RX usage and use this to fall
back to the PFMEMALLOC reserve when needed.  SKBs allocated from the
reserve are tagged in skb->pfmemalloc.  If an SKB is allocated from the
reserve and the socket is later found to be unrelated to page reclaim, the
packet is dropped so that the memory remains available for page reclaim.
Network protocols are expected to recover from this packet loss.

[a.p.zijlstra@chello.nl: Ideas taken from various patches]
[davem@davemloft.net: Use static branches, coding style corrections]
[sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:46 -07:00