mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2024-12-27 15:35:06 +07:00
68ad4a3304
Patch series "improve vmap allocation", v3. Objective --------- Please have a look for the description at: https://lkml.org/lkml/2018/10/19/786 but let me also summarize it a bit here as well. The current implementation has O(N) complexity. Requests with different permissive parameters can lead to long allocation time. When i say "long" i mean milliseconds. Description ----------- This approach organizes the KVA memory layout into free areas of the 1-ULONG_MAX range, i.e. an allocation is done over free areas lookups, instead of finding a hole between two busy blocks. It allows to have lower number of objects which represent the free space, therefore to have less fragmented memory allocator. Because free blocks are always as large as possible. It uses the augment tree where all free areas are sorted in ascending order of va->va_start address in pair with linked list that provides O(1) access to prev/next elements. Since the tree is augment, we also maintain the "subtree_max_size" of VA that reflects a maximum available free block in its left or right sub-tree. Knowing that, we can easily traversal toward the lowest (left most path) free area. Allocation: ~O(log(N)) complexity. It is sequential allocation method therefore tends to maximize locality. The search is done until a first suitable block is large enough to encompass the requested parameters. Bigger areas are split. I copy paste here the description of how the area is split, since i described it in https://lkml.org/lkml/2018/10/19/786 <snip> A free block can be split by three different ways. Their names are FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they correspond to how requested size and alignment fit to a free block. FL_FIT_TYPE - in this case a free block is just removed from the free list/tree because it fully fits. Comparing with current design there is an extra work with rb-tree updating. LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do is just cutting a free block. It is as fast as a current design. Most of the vmalloc allocations just end up with this case, because the edge is always aligned to 1. NE_FIT_TYPE - Is much less common case. Basically it happens when requested size and alignment does not fit left nor right edges, i.e. it is between them. In this case during splitting we have to build a remaining left free area and place it back to the free list/tree. Comparing with current design there are two extra steps. First one is we have to allocate a new vmap_area structure. Second one we have to insert that remaining free block to the address sorted list/tree. In order to optimize a first case there is a cache with free_vmap objects. Instead of allocating from slab we just take an object from the cache and reuse it. Second one is pretty optimized. Since we know a start point in the tree we do not do a search from the top. Instead a traversal begins from a rb-tree node we split. <snip> De-allocation. ~O(log(N)) complexity. An area is not inserted straight away to the tree/list, instead we identify the spot first, checking if it can be merged around neighbors. The list provides O(1) access to prev/next, so it is pretty fast to check it. Summarizing. If merged then large coalesced areas are created, if not the area is just linked making more fragments. There is one more thing that i should mention here. After modification of VA node, its subtree_max_size is updated if it was/is the biggest area in its left or right sub-tree. Apart of that it can also be populated back to upper levels to fix the tree. For more details please have a look at the __augment_tree_propagate_from() function and the description. Tests and stressing ------------------- I use the "test_vmalloc.sh" test driver available under "tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo ./test_vmalloc.sh" to find out how to deal with it. Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA. Regarding last one, i do not have any physical access to NUMA system, therefore i emulated it. The time of stressing is days. If you run the test driver in "stress mode", you also need the patch that is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it: http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c After massive testing, i have not identified any problems like memory leaks, crashes or kernel panics. I find it stable, but more testing would be good. Performance analysis -------------------- I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1 as well, but the results have not been ready by time i an writing this. Currently it consist of 8 tests. There are three of them which correspond to different types of splitting(to compare with default). We have 3 ones(see above). Another 5 do allocations in different conditions. a) sudo ./test_vmalloc.sh performance When the test driver is run in "performance" mode, it runs all available tests pinned to first online CPU with sequential execution test order. We do it in order to get stable and repeatable results. Take a look at time difference in "long_busy_list_alloc_test". It is not surprising because the worst case is O(N). # i5-3320M How many cycles all tests took: CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt # Hikey960 8x CPUs How many cycles all tests took: CPU0=3478683207 cycles vs CPU0=463767978 cycles # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt b) time sudo ./test_vmalloc.sh test_repeat_count=1 With this configuration, all tests are run on all available online CPUs. Before running each CPU shuffles its tests execution order. It gives random allocation behaviour. So it is rough comparison, but it puts in the picture for sure. # i5-3320M <default> vs <patched> real 101m22.813s real 0m56.805s user 0m0.011s user 0m0.015s sys 0m5.076s sys 0m0.023s # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt # Hikey960 8x CPUs <default> vs <patched> real unknown real 4m25.214s user unknown user 0m0.011s sys unknown sys 0m0.670s I did not manage to complete this test on "default Hikey960" kernel version. After 24 hours it was still running, therefore i had to cancel it. That is why real/user/sys are "unknown". This patch (of 3): Currently an allocation of the new vmap area is done over busy list iteration(complexity O(n)) until a suitable hole is found between two busy areas. Therefore each new allocation causes the list being grown. Due to over fragmented list and different permissive parameters an allocation can take a long time. For example on embedded devices it is milliseconds. This patch organizes the KVA memory layout into free areas of the 1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks sorted by their offsets in pair with linked list keeping the free space in order of increasing addresses. Nodes are augmented with the size of the maximum available free block in its left or right sub-tree. Thus, that allows to take a decision and traversal toward the block that will fit and will have the lowest start address, i.e. it is sequential allocation. Allocation: to allocate a new block a search is done over the tree until a suitable lowest(left most) block is large enough to encompass: the requested size, alignment and vstart point. If the block is bigger than requested size - it is split. De-allocation: when a busy vmap area is freed it can either be merged or inserted to the tree. Red-black tree allows efficiently find a spot whereas a linked list provides a constant-time access to previous and next blocks to check if merging can be done. In case of merging of de-allocated memory chunk a large coalesced area is created. Complexity: ~O(log(N)) [urezki@gmail.com: v3] Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
229 lines
6.7 KiB
C
229 lines
6.7 KiB
C
/* SPDX-License-Identifier: GPL-2.0 */
|
|
#ifndef _LINUX_VMALLOC_H
|
|
#define _LINUX_VMALLOC_H
|
|
|
|
#include <linux/spinlock.h>
|
|
#include <linux/init.h>
|
|
#include <linux/list.h>
|
|
#include <linux/llist.h>
|
|
#include <asm/page.h> /* pgprot_t */
|
|
#include <linux/rbtree.h>
|
|
#include <linux/overflow.h>
|
|
|
|
struct vm_area_struct; /* vma defining user mapping in mm_types.h */
|
|
struct notifier_block; /* in notifier.h */
|
|
|
|
/* bits in flags of vmalloc's vm_struct below */
|
|
#define VM_IOREMAP 0x00000001 /* ioremap() and friends */
|
|
#define VM_ALLOC 0x00000002 /* vmalloc() */
|
|
#define VM_MAP 0x00000004 /* vmap()ed pages */
|
|
#define VM_USERMAP 0x00000008 /* suitable for remap_vmalloc_range */
|
|
#define VM_UNINITIALIZED 0x00000020 /* vm_struct is not fully initialized */
|
|
#define VM_NO_GUARD 0x00000040 /* don't add guard page */
|
|
#define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */
|
|
/*
|
|
* Memory with VM_FLUSH_RESET_PERMS cannot be freed in an interrupt or with
|
|
* vfree_atomic().
|
|
*/
|
|
#define VM_FLUSH_RESET_PERMS 0x00000100 /* Reset direct map and flush TLB on unmap */
|
|
/* bits [20..32] reserved for arch specific ioremap internals */
|
|
|
|
/*
|
|
* Maximum alignment for ioremap() regions.
|
|
* Can be overriden by arch-specific value.
|
|
*/
|
|
#ifndef IOREMAP_MAX_ORDER
|
|
#define IOREMAP_MAX_ORDER (7 + PAGE_SHIFT) /* 128 pages */
|
|
#endif
|
|
|
|
struct vm_struct {
|
|
struct vm_struct *next;
|
|
void *addr;
|
|
unsigned long size;
|
|
unsigned long flags;
|
|
struct page **pages;
|
|
unsigned int nr_pages;
|
|
phys_addr_t phys_addr;
|
|
const void *caller;
|
|
};
|
|
|
|
struct vmap_area {
|
|
unsigned long va_start;
|
|
unsigned long va_end;
|
|
|
|
/*
|
|
* Largest available free size in subtree.
|
|
*/
|
|
unsigned long subtree_max_size;
|
|
unsigned long flags;
|
|
struct rb_node rb_node; /* address sorted rbtree */
|
|
struct list_head list; /* address sorted list */
|
|
struct llist_node purge_list; /* "lazy purge" list */
|
|
struct vm_struct *vm;
|
|
};
|
|
|
|
/*
|
|
* Highlevel APIs for driver use
|
|
*/
|
|
extern void vm_unmap_ram(const void *mem, unsigned int count);
|
|
extern void *vm_map_ram(struct page **pages, unsigned int count,
|
|
int node, pgprot_t prot);
|
|
extern void vm_unmap_aliases(void);
|
|
|
|
#ifdef CONFIG_MMU
|
|
extern void __init vmalloc_init(void);
|
|
#else
|
|
static inline void vmalloc_init(void)
|
|
{
|
|
}
|
|
#endif
|
|
|
|
extern void *vmalloc(unsigned long size);
|
|
extern void *vzalloc(unsigned long size);
|
|
extern void *vmalloc_user(unsigned long size);
|
|
extern void *vmalloc_node(unsigned long size, int node);
|
|
extern void *vzalloc_node(unsigned long size, int node);
|
|
extern void *vmalloc_exec(unsigned long size);
|
|
extern void *vmalloc_32(unsigned long size);
|
|
extern void *vmalloc_32_user(unsigned long size);
|
|
extern void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot);
|
|
extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
|
|
unsigned long start, unsigned long end, gfp_t gfp_mask,
|
|
pgprot_t prot, unsigned long vm_flags, int node,
|
|
const void *caller);
|
|
#ifndef CONFIG_MMU
|
|
extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
|
|
static inline void *__vmalloc_node_flags_caller(unsigned long size, int node,
|
|
gfp_t flags, void *caller)
|
|
{
|
|
return __vmalloc_node_flags(size, node, flags);
|
|
}
|
|
#else
|
|
extern void *__vmalloc_node_flags_caller(unsigned long size,
|
|
int node, gfp_t flags, void *caller);
|
|
#endif
|
|
|
|
extern void vfree(const void *addr);
|
|
extern void vfree_atomic(const void *addr);
|
|
|
|
extern void *vmap(struct page **pages, unsigned int count,
|
|
unsigned long flags, pgprot_t prot);
|
|
extern void vunmap(const void *addr);
|
|
|
|
extern int remap_vmalloc_range_partial(struct vm_area_struct *vma,
|
|
unsigned long uaddr, void *kaddr,
|
|
unsigned long size);
|
|
|
|
extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
|
|
unsigned long pgoff);
|
|
void vmalloc_sync_all(void);
|
|
|
|
/*
|
|
* Lowlevel-APIs (not for driver use!)
|
|
*/
|
|
|
|
static inline size_t get_vm_area_size(const struct vm_struct *area)
|
|
{
|
|
if (!(area->flags & VM_NO_GUARD))
|
|
/* return actual size without guard page */
|
|
return area->size - PAGE_SIZE;
|
|
else
|
|
return area->size;
|
|
|
|
}
|
|
|
|
extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags);
|
|
extern struct vm_struct *get_vm_area_caller(unsigned long size,
|
|
unsigned long flags, const void *caller);
|
|
extern struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
|
|
unsigned long start, unsigned long end);
|
|
extern struct vm_struct *__get_vm_area_caller(unsigned long size,
|
|
unsigned long flags,
|
|
unsigned long start, unsigned long end,
|
|
const void *caller);
|
|
extern struct vm_struct *remove_vm_area(const void *addr);
|
|
extern struct vm_struct *find_vm_area(const void *addr);
|
|
|
|
extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
|
|
struct page **pages);
|
|
#ifdef CONFIG_MMU
|
|
extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
|
|
pgprot_t prot, struct page **pages);
|
|
extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size);
|
|
extern void unmap_kernel_range(unsigned long addr, unsigned long size);
|
|
static inline void set_vm_flush_reset_perms(void *addr)
|
|
{
|
|
struct vm_struct *vm = find_vm_area(addr);
|
|
|
|
if (vm)
|
|
vm->flags |= VM_FLUSH_RESET_PERMS;
|
|
}
|
|
#else
|
|
static inline int
|
|
map_kernel_range_noflush(unsigned long start, unsigned long size,
|
|
pgprot_t prot, struct page **pages)
|
|
{
|
|
return size >> PAGE_SHIFT;
|
|
}
|
|
static inline void
|
|
unmap_kernel_range_noflush(unsigned long addr, unsigned long size)
|
|
{
|
|
}
|
|
static inline void
|
|
unmap_kernel_range(unsigned long addr, unsigned long size)
|
|
{
|
|
}
|
|
static inline void set_vm_flush_reset_perms(void *addr)
|
|
{
|
|
}
|
|
#endif
|
|
|
|
/* Allocate/destroy a 'vmalloc' VM area. */
|
|
extern struct vm_struct *alloc_vm_area(size_t size, pte_t **ptes);
|
|
extern void free_vm_area(struct vm_struct *area);
|
|
|
|
/* for /dev/kmem */
|
|
extern long vread(char *buf, char *addr, unsigned long count);
|
|
extern long vwrite(char *buf, char *addr, unsigned long count);
|
|
|
|
/*
|
|
* Internals. Dont't use..
|
|
*/
|
|
extern struct list_head vmap_area_list;
|
|
extern __init void vm_area_add_early(struct vm_struct *vm);
|
|
extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
|
|
|
|
#ifdef CONFIG_SMP
|
|
# ifdef CONFIG_MMU
|
|
struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
|
|
const size_t *sizes, int nr_vms,
|
|
size_t align);
|
|
|
|
void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms);
|
|
# else
|
|
static inline struct vm_struct **
|
|
pcpu_get_vm_areas(const unsigned long *offsets,
|
|
const size_t *sizes, int nr_vms,
|
|
size_t align)
|
|
{
|
|
return NULL;
|
|
}
|
|
|
|
static inline void
|
|
pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
|
|
{
|
|
}
|
|
# endif
|
|
#endif
|
|
|
|
#ifdef CONFIG_MMU
|
|
#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
|
|
#else
|
|
#define VMALLOC_TOTAL 0UL
|
|
#endif
|
|
|
|
int register_vmap_purge_notifier(struct notifier_block *nb);
|
|
int unregister_vmap_purge_notifier(struct notifier_block *nb);
|
|
|
|
#endif /* _LINUX_VMALLOC_H */
|