Whoops, I missed removing the kerneldoc comment of the lrucare arg
removed from mem_cgroup_replace_page; but it's a good comment, keep it.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 42cb14b110 ("mm: migrate dirty page without
clear_page_dirty_for_io etc") simplified the migration of a PageDirty
pagecache page: one stat needs moving from zone to zone and that's about
all.
It's convenient and safest for it to shift the PageDirty bit from old
page to new, just before updating the zone stats: before copying data
and marking the new PageUptodate. This is all done while both pages are
isolated and locked, just as before; and just as before, there's a
moment when the new page is visible in the radix_tree, but not yet
PageUptodate. What's new is that it may now be briefly visible as
PageDirty before it is PageUptodate.
When I scoured the tree to see if this could cause a problem anywhere,
the only places I found were in two similar functions __r4w_get_page():
which look up a page with find_get_page() (not using page lock), then
claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.
I'm not sure whether that was right before, but now it might be wrong
(on rare occasions): only claim the page is uptodate if PageUptodate.
Or perhaps the page in question could never be migratable anyway?
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Boaz Harrosh <ooo@electrozaur.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir architected and authored much of the current state of the
memcg's slab memory accounting and tracking. Make sure he gets CC'd on
bug reports ;-)
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa has reported that the system might basically livelock in
OOM condition without triggering the OOM killer.
The issue is caused by internal dependency of the direct reclaim on
vmstat counter updates (via zone_reclaimable) which are performed from
the workqueue context. If all the current workers get assigned to an
allocation request, though, they will be looping inside the allocator
trying to reclaim memory but zone_reclaimable can see stalled numbers so
it will consider a zone reclaimable even though it has been scanned way
too much. WQ concurrency logic will not consider this situation as a
congested workqueue because it relies that worker would have to sleep in
such a situation. This also means that it doesn't try to spawn new
workers or invoke the rescuer thread if the one is assigned to the
queue.
In order to fix this issue we need to do two things. First we have to
let wq concurrency code know that we are in trouble so we have to do a
short sleep. In order to prevent from issues handled by 0e093d9976
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.
The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Cristopher Lameter <clameter@sgi.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 016c13daa5 ("mm, page_alloc: use masks and shifts when
converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
MIGRATE_RECLAIMABLE in the enum definition. However, migratetype_names
wasn't updated to reflect that.
As a result, the file /proc/pagetypeinfo shows the counts for Movable as
Reclaimable and vice versa.
Additionally, commit 0aaa29a56e ("mm, page_alloc: reserve pageblocks
for high-order atomic allocations on demand") introduced
MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
show_migration_types(), so it doesn't appear in the listing of free
areas during page alloc failures or oom kills.
This patch fixes both problems. The atomic reserves will show with a
letter 'H' in the free areas listings.
Fixes: 016c13daa5 ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
Fixes: 0aaa29a56e ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the memory.high threshold is exceeded, try_charge() schedules a
task_work to reclaim the excess. The reclaim target is set to the
number of pages requested by try_charge().
This is wrong, because try_charge() usually charges more pages than
requested (batch > nr_pages) in order to refill per cpu stocks. As a
result, a process in a cgroup can easily exceed memory.high
significantly when doing a lot of charges w/o returning to userspace
(e.g. reading a file in big chunks).
Fix this issue by assuring that when exceeding memory.high a process
reclaims as many pages as were actually charged (i.e. batch).
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
alloc_buddy_huge_page() to directly create a hugepage from the buddy
allocator.
In that case, however, if alloc_buddy_huge_page() succeeds we don't
decrement h->resv_huge_pages, which means that successful
hugetlb_fault() returns without releasing the reserve count. As a
result, subsequent hugetlb_fault() might fail despite that there are
still free hugepages.
This patch simply adds decrementing code on that code path.
I reproduced this problem when testing v4.3 kernel in the following situation:
- the test machine/VM is a NUMA system,
- hugepage overcommiting is enabled,
- most of hugepages are allocated and there's only one free hugepage
which is on node 0 (for example),
- another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
node 1, tries to allocate a hugepage,
- the allocation should fail but the reserve count is still hold.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org> [3.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are no callers of pcibios_init_bus(), so remove it.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Helge Deller <deller@gmx.de>
When using the Promise TX2+ SATA controller on PA-RISC, the system often
crashes with kernel panic, for example just writing data with the dd
utility will make it crash.
Kernel panic - not syncing: drivers/parisc/sba_iommu.c: I/O MMU @ 000000000000a000 is out of mapping resources
CPU: 0 PID: 18442 Comm: mkspadfs Not tainted 4.4.0-rc2 #2
Backtrace:
[<000000004021497c>] show_stack+0x14/0x20
[<0000000040410bf0>] dump_stack+0x88/0x100
[<000000004023978c>] panic+0x124/0x360
[<0000000040452c18>] sba_alloc_range+0x698/0x6a0
[<0000000040453150>] sba_map_sg+0x260/0x5b8
[<000000000c18dbb4>] ata_qc_issue+0x264/0x4a8 [libata]
[<000000000c19535c>] ata_scsi_translate+0xe4/0x220 [libata]
[<000000000c19a93c>] ata_scsi_queuecmd+0xbc/0x320 [libata]
[<0000000040499bbc>] scsi_dispatch_cmd+0xfc/0x130
[<000000004049da34>] scsi_request_fn+0x6e4/0x970
[<00000000403e95a8>] __blk_run_queue+0x40/0x60
[<00000000403e9d8c>] blk_run_queue+0x3c/0x68
[<000000004049a534>] scsi_run_queue+0x2a4/0x360
[<000000004049be68>] scsi_end_request+0x1a8/0x238
[<000000004049de84>] scsi_io_completion+0xfc/0x688
[<0000000040493c74>] scsi_finish_command+0x17c/0x1d0
The cause of the crash is not exhaustion of the IOMMU space, there is
plenty of free pages. The function sba_alloc_range is called with size
0x11000, thus the pages_needed variable is 0x11. The function
sba_search_bitmap is called with bits_wanted 0x11 and boundary size is
0x10 (because dma_get_seg_boundary(dev) returns 0xffff).
The function sba_search_bitmap attempts to allocate 17 pages that must not
cross 16-page boundary - it can't satisfy this requirement
(iommu_is_span_boundary always returns true) and fails even if there are
many free entries in the IOMMU space.
How did it happen that we try to allocate 17 pages that don't cross
16-page boundary? The cause is in the function iommu_coalesce_chunks. This
function tries to coalesce adjacent entries in the scatterlist. The
function does several checks if it may coalesce one entry with the next,
one of those checks is this:
if (startsg->length + dma_len > max_seg_size)
break;
When it finishes coalescing adjacent entries, it allocates the mapping:
sg_dma_len(contig_sg) = dma_len;
dma_len = ALIGN(dma_len + dma_offset, IOVP_SIZE);
sg_dma_address(contig_sg) =
PIDE_FLAG
| (iommu_alloc_range(ioc, dev, dma_len) << IOVP_SHIFT)
| dma_offset;
It is possible that (startsg->length + dma_len > max_seg_size) is false
(we are just near the 0x10000 max_seg_size boundary), so the funcion
decides to coalesce this entry with the next entry. When the coalescing
succeeds, the function performs
dma_len = ALIGN(dma_len + dma_offset, IOVP_SIZE);
And now, because of non-zero dma_offset, dma_len is greater than 0x10000.
iommu_alloc_range (a pointer to sba_alloc_range) is called and it attempts
to allocate 17 pages for a device that must not cross 16-page boundary.
To fix the bug, we must make sure that dma_len after addition of
dma_offset and alignment doesn't cross the segment boundary. I.e. change
if (startsg->length + dma_len > max_seg_size)
break;
to
if (ALIGN(dma_len + dma_offset + startsg->length, IOVP_SIZE) > max_seg_size)
break;
This patch makes this change (it precalculates max_seg_boundary at the
beginning of the function iommu_coalesce_chunks). I also added a check
that the mapping length doesn't exceed dma_get_seg_boundary(dev) (it is
not needed for Promise TX2+ SATA, but it may be needed for other devices
that have dma_get_seg_boundary lower than dma_get_max_seg_size).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Helge Deller <deller@gmx.de>
Some X550 devices can connect at 2.5Gbps during fail-over, but only
with certain link partners. Also setting the advertised speed will
not work so we do not report it as supported to avoid confusion.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch guarantees that the VFs do not have access to VLANs that they
were not supposed to. What this patch does is add code so that we delete
the previous port VLAN after adding a new one, and if we reset the VF we
clear all of the filters associated with it.
Previously the code was leaving all previous VLANs mapped to the VF and
they didn't get deleted unless the VF specifically requested it or if the
PF itself was reset.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch makes certain that we clear the pool mappings added when we
configure default MAC addresses for the interface. Without this we run the
risk of leaking an address into pool 0 which really belongs to VF 0 when
SR-IOV is enabled.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch is a follow-on for enabling VLAN promiscuous and allowing the PF
to add VLANs without adding a VLVF entry. What this patch does is go
through and free the VLVF registers if they are not needed as the VLAN
belongs only to the PF which is the default pool.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch adds support for VLAN promiscuous with SR-IOV enabled.
The code prior to this patch was only adding the PF to VLANs that the VF
had added. As such enabling promiscuous mode would actually not add any
additional VLAN filters so visibility was limited. This lead to a number
of issues as the bridge and OVS would expect us to accept all VLAN tagged
packets when promiscuous mode was enabled, and instead we would filter out
most if not all depending on the configuration of the PF.
With this patch what we do is set all the bits in the VFTA and all of the
VLVF bits associated with the pool belonging to the PF. By doing this the
PF is guaranteed to receive all VLAN tagged traffic associated with the RAR
filters assigned to the PF. In addition we will clean up those same bits
in the event of promiscuous mode being disabled.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch is meant to reduce the complexity of the search function used
for finding a VLVF entry associated with a given VLAN ID. The previous
code was searching from bottom to top. I reordered it to search from top
to bottom. In addition I pulled an AND statement out of the loop and
instead replaced it with an OR statement outside the loop. This should
help to reduce the overall size and complexity of the function.
There was also some formatting I cleaned up in regards to whitespace and
such.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch adds support for bypassing the VLVF entry creation when the PF
is adding a new VLAN. The advantage to doing this is that we can then save
the VLVF entries for the VFs which must have them in order to function,
versus the PF which can fall back on the default pool entry.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch addresses several issues within the VLVF and VLVFB
configuration
First was the fact that code was overly complicated with multiple
conditional paths depending on if we adding or removing and which bit we
were going to add or remove. Instead of messing with all that I have
simplified it by using (vid / 32) and (1 - vid / 32) to identify our
register and the other vlvfb register.
Second was the fact that we were likely leaking a few packets into the PF
in cases where we were deleting an entry and the VFTA filter for that entry
as the ordering was such that we deleted the pool and then the VLAN filter
instead of the other way around. I have updated that by adding a check for
no bits being set and if that occurs we clear things up in the proper
order.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
In order to clear the way for upcoming work I thought it best to drop the
level of indent in the ixgbe_set_vfta_generic function. Most of the code
is held in the virtualization specific section. So the easiest approach is
to just add a jump label and jump past the bulk of the code if it is not
enabled.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch simplifies the logic for setting the VFTA register by removing
the number of conditional checks needed. Instead we just use some boolean
logic to generate vfta_delta, and if that is set then we xor the vfta by
that value and write it back.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The code for checking the PF bit in ixgbe_set_vf_vlan_msg was using the
wrong offset and as a result it was pulling the VLAN off of the PF even if
there were VFs numbered greater than 40 that still had the VLAN enabled.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Add a check to make certain mac_table was actually allocated and is not
NULL. If it is NULL return -ENOMEM and allow the probe routine to fail
rather then causing a NULL pointer dereference further down the line.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sensor index should be passed instead of 0. For now, this does not make
a difference, since there is so far only one temperature sensor
exposed by HW.
Fixes: 89309da39 ("mlxsw: core: Implement temperature hwmon interface")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix copy & paste error in MTPM unpack helper.
Fixes: 85926f8770 ("mlxsw: reg: Add definition of temperature management registers")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Shearman says:
====================
mpls: fixes for nexthops without via addresses
These four fixes all apply to the case of having an mpls route with an
output device, but without a nexthop.
Patches 2 and 3 could really have been combined in one patch, but I
wanted to separate the fix for some recent breakage from the fix for a
day-1 issue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The via address is optional for a single path route, yet is mandatory
when the multipath attribute is used:
# ip -f mpls route add 100 dev lo
# ip -f mpls route add 101 nexthop dev lo
RTNETLINK answers: Invalid argument
Make them consistent by making the via address optional when the
RTA_MULTIPATH attribute is being parsed so that both forms of
specifying the route work.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a via address isn't specified, the via table is left initialised
to 0 (NEIGH_ARP_TABLE), and the via address length also left
initialised to 0. This results in a via address array of length 0
being allocated (contiguous with route and nexthop array), meaning
that when a packet is sent using neigh_xmit the neighbour lookup and
creation will cause an out-of-bounds access when accessing the 4 bytes
of the IPv4 address it assumes it has been given a pointer to.
This could be fixed by allocating the 4 bytes of via address necessary
and leaving it as all zeroes. However, it seems wrong to me to use an
ipv4 nexthop (including possibly ARPing for 0.0.0.0) when the user
didn't specify to do so.
Instead, set the via address table to NEIGH_NR_TABLES to signify it
hasn't been specified and use this at forwarding time to signify a
neigh_xmit using an L2 address consisting of the device address. This
mechanism is the same as that used for both ARP and ND for loopback
interfaces and those flagged as no-arp, which are all we can really
support in this case.
Fixes: cf4b24f002 ("mpls: reduce memory usage of routes")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The problem seen is that when adding a route with a nexthop with no
via address specified, iproute2 generates bogus output:
# ip -f mpls route add 100 dev lo
# ip -f mpls route list
100 via inet 0.0.8.0 dev lo
The reason for this is that the kernel generates an RTA_VIA attribute
with the family set to AF_INET, but the via address data having zero
length. The cause of family being AF_INET is that on route insert
cfg->rc_via_table is left set to 0, which just happens to be
NEIGH_ARP_TABLE which is then translated into AF_INET.
iproute2 doesn't validate the length prior to printing and so prints
garbage. Although it could be fixed to do the validation, I would
argue that AF_INET addresses should always be exactly 4 bytes so the
kernel is really giving userspace bogus data.
Therefore, avoid generating the RTA_VIA attribute when dumping the
route if the via address wasn't specified on add/modify. This is
indicated by NEIGH_ARP_TABLE and a zero via address length - if the
user specified a via address the address length would have been
validated such that it was 4 bytes. Although this is a change in
behaviour that is visible to userspace, I believe that what was
generated before was invalid and as such userspace wouldn't be
expecting it.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an L2 via address for an mpls nexthop is specified, the length of
the L2 address must match that expected by the output device,
otherwise it could access memory beyond the end of the via address
buffer in the route.
This check was present prior to commit f8efb73c97 ("mpls: multipath
route support"), but got lost in the refactoring, so add it back,
applying it to all nexthops in multipath routes.
Fixes: f8efb73c97 ("mpls: multipath route support")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this, filter insertion on a VF would fail if only one channel
was in use. This would include the unicast station filter and therefore
no traffic would be received.
Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
mlx5 improved flow steering management
First two patches fixes some minor issues in recently
introduced SRIOV code.
The other seven patches modifies the driver's code that
manages flow steering rules with Connectx-4 devices.
Basic introduction:
The flow steering device specification model is composed of the following entities:
Destination (either a TIR/Flow table/vport), where TIR is RSS end-point, vport
is the VF eSwitch port in SRIOV.
Flow table entry (FTE) - the values used by the flow specification
Flow table group (FG) - the masks used by the flow specification
Flow table (FT) - groups several FGs and can serve as destination
The flow steering software entities:
In addition to the device objects, the software have two more objects:
Priorities - group several FTs. Handles order of packet matching.
Namespaces - group several priorities. Namespace are used in order to
isolate different usages of steering (for example, add two separate
namespaces, one for the NIC driver and one for E-Switch FDB).
The base data structure for the flow steering management is a tree and
all the flow steering objects such as (Namespace/Flow table/Flow Group/FTE/etc.)
are represented as a node in the tree, e.g.:
Priority-0 -> FT1 -> FG -> FTE -> TIR (destination)
Priority-1 -> FT2 -> FG-> FTE -> TIR (destination)
Matching begins in FT1 flow rules and if there is a miss on all the FTEs
then matching continues on the FTEs in FT2.
The new implementation solves/improves the following
issues in the current code:
1) The new impl. supports multiple destinations, the search for existing rule with
the same matching value is performed by the flow steering management.
In the current impl. the E-switch FDB management code needs to search
for existing rules before calling to the add rule function.
2) The new impl. manages the flow table level, in the current implementation the
consumer states the flow table level when new flow table is created without
any knowledge about the levels of other flow tables.
3) In the current impl. the consumer can't create or destroy flow
groups dynamically, the flow groups are passed as argument to the create
flow table API. The new impl. exposes API for create/destroy flow group.
The series is built as follows:
Patch #1 add flow steering API firmware commands.
Patch #2 add tree operation of the flow steering tree: add/remove node,
initialize node and take reference count on a node.
Patch #3 add essential algorithms for managing the flow steering.
Patch #4 Initialize the flow steering tree, flow steering initialization is based
on static tree which illustrates the flow steering tree when the driver is loaded.
Patch #5 is the main patch of the series. It introduce the flow steering API.
Patch #6 Expose the new flow steering API and remove the old one.
The Ethernet flow steering follows the existing implementation,
but uses the new steering API.
Patch #7 Rename en_flow_table.c to en_fs.c in order to be aligned with
the new flow steering files.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename en_flow_table.c to en_fs.c in order to be aligned
with the new flow steering files.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose the new flow steering API and remove the old
one.
Few changes are required:
1. The Ethernet flow steering follows the existing implementation, but uses
the new steering API. The old flow steering implementation is removed.
2. Move the E-switch FDB management to use the new API.
3. When driver is loaded call to mlx5_init_fs which initialize
the flow steering tree structure, open namespaces for NIC receive
and for E-switch FDB.
4. Call to mlx5_cleanup_fs when the driver is unloaded.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flow steering initialization is based on static tree which
illustrates the flow steering tree when the driver is loaded. The
initialization considers the max supported flow table level of the device,
a minimum of 2 kernel flow tables(vlan and mac) are required to have
kernel flow table functionality.
The tree structures when the driver is loaded:
root_namespace(receive nic)
|
priority-0 (kernel priority)
|
namespace(kernel namespace)
|
priority-0 (flow tables priority)
In the following patches, When the EN driver will use the flow steering
API, it create two flow tables and their flow groups under
priority-0(flow tables priority).
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introducing the following objects:
mlx5_flow_root_namespace: represent the root of specific flow table
type tree(e.g NIC receive, FDB, etc..)
mlx5_flow_group: define the mask of the flow specification.
fs_fte(flow steering flow table entry): defines the value of the
flow specification.
The following describes the relationships between the tree objects:
root_namespace --> priorities -->namespaces -->
priorities -->flow-tables --> flow-groups -->
flow-entries --> destinations
When we create new object(flow table/flow group/flow table entry), we
call to the FW command and then we add the related sw object to the tree.
When we destroy object, e.g. call to mlx5_destroy_flow_table, we use
the tree node destructor for destroying the FW object and remove the
node from the tree.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce the flow steering mlx5_flow_namespace (Namespace)
and fs_prio (Flow Steering Priority) tree nodes.
Namespaces are used in order to isolate different usages or types
of steering (for example, downstream patches will add a different
namespaces for the NIC driver and for E-Switch FDB usages).
Flow Steering Priorities are objects that describes priorities
ranges between different flow objects under the same namespace.
Example, entries in priority i are matched before entries
in priority i+1.
This patch adds the following algorithms:
1) Calculate level:
Each flow table has level(the priority between the flow tables).
When we initialize the flow steering tree, we assign range of levels
to each priority, therefore the level for new flow table is
the location within the priority related to the range of the priority.
2) Match between match criteria. This function is used
for searching flow group when new flow rule is added.
3) Match between match values. This function is used
for searching flow table entry when new flow rule is added.
4) Add essential macros for traversing on a node's children.
E.g. traversing on all the flow table of some priority
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introducing the base data structure and its operations that are
going to represent ConnectX-4 Flow Steering, this data structure
is basically a tree and all Flow steering objects such as
(Flow Table/Flow Group/FTE/etc ..) are represented as fs_node(s).
fs_node is the base object which describes a basic tree node, with the
following extra info:
type: describes the runtime type of the node (Object).
lock: lock this node sub-tree.
ref_count: number of children + current references.
remove_func: a generic destructor.
fs_node types will be used and explained once the usage is added in the
following patches.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce new Flow Steering (FS) firmware commands,
in-order to support the new flow steering infrastructure.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under SRIOV there might be a case where VFs are loaded
without pre-assigned MAC address. In this case, the VF
will randomize its own MAC. This will address the case
of administrator not assigning MAC to the VF through
the PF OS APIs and keep udev happy.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
E-Switch capabilities should be queried only if E-Switch flow table
is supported and not only when vport group manager.
Fixes: d6666753c6 ("net/mlx5: E-Switch, Introduce HCA cap and E-Switch vport context")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The file ila.h used for lightweight tunnels is being used by iproute2
but is not exported yet.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham says:
====================
net: thunderx: Support for pass-2 hw features
This patch set adds support for new features added in pass-2 revision
of hardware like TSO and count based interrupt coalescing.
Changes from v1:
- Addressed comments received regarding boolean bit field changes
by excluding them from this patch. Will submit a seperate
patch along with cleanup of unsed field.
- Got rid of new macro 'VNIC_NAPI_WEIGHT' introduced in
count threshold interrupt patch.
====================
Reviewed-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This feature is introduced in pass-2 chip and with this CQ interrupt
coalescing will work based on both timer and count.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds support for offloading TCP segmentation to HW in pass-2
revision of hardware. Both driver level SW TSO for pass1.x chips
and HW TSO for pass-2 chip will co-exist. Modified SQ descriptor
structures to reflect pass-2 hw implementation.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan says:
====================
bnxt_en: Bug fix and add tx timeout recovery.
Fix a bitmap declaration bug and add missing tx timeout recovery.
v2: Fixed white space error. Thanks Dave.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The reset logic calls bnxt_close_nic() and bnxt_open_nic() under rtnl_lock
from bnxt_sp_task. BNXT_STATE_IN_SP_TASK must be cleared before calling
bnxt_close_nic() to avoid deadlock.
v2: Fixed white space error. Thanks Dave.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When implementing driver reset from tx_timeout in the next patch,
bnxt_close_nic() will be called from the sp_task workqueue. Calling
cancel_work() on sp_task will hang the workqueue.
Instead, set a new bit BNXT_STATE_IN_SP_TASK when bnxt_sp_task() is running.
bnxt_close_nic() will wait for BNXT_STATE_IN_SP_TASK to clear before
proceeding.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows multiple independent bits to be set for various states.
Subsequent patches to implement tx timeout reset will require this.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The declaration of the bitmap vf_req_snif_bmap using fixed array of
unsigned long will only work on 64-bit archs. Use DECLARE_BITMAP instead
which will work on all archs.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>