Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Size: 1.32 MB
Language: en
Added: Dec 05, 2022
Slides: 54 pages
Slide Content
* Based on kernel 5.11 (x86_64) –QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* Kernel parameter: nokaslrnorandmaps
* KASAN: disabled
* Userspace: ASLR is disabled
* Legacy BIOS
malloc & vmallocin Linux
Adrian Huang | Dec,2022
Agenda
•Memory Allocation in Linux
•malloc -> brk() implementation in Linux Kernel
oWill *NOT* focus on glibcmalloc implementation: You can read this link: malloc internal
•vmalloc: Non-contiguous memory allocation
•[Note] kmallochas been discussed here: Slide #88 of Slab Allocator in Linux
Kernel
Memory Allocation in Linux
Buddy System
alloc_page(s), __get_free_page(s)
Slab Allocator
kmalloc/kfree
glibc: malloc/free
brk/mmap
. . .
vmalloc
User Space
Kernel Space
Hardware
•Balance between brk()and mmap()
•Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
oThe heap can be trimmed only if memory is freed at the top end.
osbrk() is implemented as a library function that uses the brk() system call.
oWhen the heap is used up, allocate memory chunk > 128KB via brk().
▪Save overhead for frequent system call ‘brk()’
•Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
oThe allocated memory blocks can be independently released back to the system.
oDeallocated space is not placed on the free list for reuse by later allocations.
oMemory may be wasted because mmapallocations must be page-aligned; and the
kernel must perform the expensive task of zeroing out memory allocated.
oNote: glibcuses the dynamic mmapthreshold
oDetail: `man mallopt`
[glibc] malloc
•kmalloc: Contiguous memory allocation
•vmalloc: Non-contiguous memory allocation
oScenario: memory allocation size > PAGE_SIZE (4KB)
oAllocate virtually contiguous memory
▪Physical memory might NOT be contiguous
kmalloc& vmalloc
malloc() -> brk() implementation in
Linux Kernel
•Quick view: Process Address Space –Heap
•sys_brk–Call path
•[From scratch] Launch a program: load_elf_binary() in Linux kernel
oVMA change observation
oHeap (brkor program break) configuration
•[Program Launch] straceobservation: heap –brk()
•straceobservation: allocate space via malloc()
oIf the heap space is used up, how about allocation size when calling malloc()->brk?
•glibc: malloc implementation for memory request size
Text
Process Virtual Address
Data
HEAP
mm->start_code=
0x40_0000
BSS
mmap
Stack (Default size: 8MB)
mm->mmap_base=
0x7FFF_F7FF_F000
STACK_TOP_MAX =
0x7FFF_FFFF_F000
0
128MB gap
0x7FFF_FFFF_FFFF
Stack Guard Gap
mm->stack
mm->brk
mm->start_brk
mm->start_data
mm->end_data
Quick view: Process Address Space -Heap
Text
Process Virtual Address
Data
HEAP
mm->start_code=
0x40_0000
BSS
mmap
Stack (Default size: 8MB)
mm->mmap_base=
0x7FFF_F7FF_F000
STACK_TOP_MAX =
0x7FFF_FFFF_F000
0
128MB gap
0x7FFF_FFFF_FFFF
Stack Guard Gap
mm->stack
mm->brk
mm->start_brk
mm->start_data
mm->end_data
Quick view: Process Address Space -Heap
Why are they different?
sys_brk–Call path
sys_brk
newbrk= PAGE_ALIGN(brk)
oldbrk= PAGE_ALIGN(mm->brk)
__do_munmap
shrink brkif brk<= mm->brk
do_brk_flags
mm->brk= brk
mm_populate
mm->def_flags& VM_LOCKED != 0
canexpand the existing
anonymous mapping
vma_merge
vm_area_alloc
cannotexpand the existing
anonymous mapping
return mm->brk
if brk< mm->start_brk
__mm_populate
populate_vma_page_range
__get_user_pages
follow_page_mask
return newbrk
mm_populate
faultin_page
handle_mm_fault
Find if the page is populated
The page is NOT populated yet
[By default] Heap (or brk) space is on-demand page
vma: R
vm_start=
0x400000
vm_end=
0x401000
vma: R, E
vm_start=
0x401000
vm_end=
0x496000
vma: R
vm_start=
0x496000
vm_end=
0x4be000
GAP
vma: R, W
vm_start=
0x4be000
vm_end=
0x4c4000
GAP
vma(vvar)
vm_start=
0x7ffff7ffa000
vm_end=
0x7ffff7ffe000
vma(vdso)
vm_start=
0x7ffff7ffe000
vm_end=
0x7ffff7fff000
vma(stack)
vm_start=
0x7fffff85d000
vm_end=
0x7ffffffff000
GAP
[From scratch] Launch a program: load_elf_binary() in Linux kernel
# ./free_and_sbrk1 1
load_elf_binary()
Kernel
vma: R
vm_start=
0x400000
vm_end=
0x401000
vma: R, E
vm_start=
0x401000
vm_end=
0x496000
vma: R
vm_start=
0x496000
vm_end=
0x4be000
GAP
vma: R, W
vm_start=
0x4be000
vm_end=
0x4c4000
GAP
vma(vvar)
vm_start=
0x7ffff7ffa000
vm_end=
0x7ffff7ffe000
vma(vdso)
vm_start=
0x7ffff7ffe000
vm_end=
0x7ffff7fff000
vma(stack)
vm_start=
0x7fffff85d000
vm_end=
0x7ffffffff000
GAP
After launching a program: Question
Why?
# ./free_and_sbrk1 1
vma: R
vm_start=
0x400000
vm_end=
0x401000
vma: R, E
vm_start=
0x401000
vm_end=
0x496000
vma: R
vm_start=
0x496000
vm_end=
0x4be000
GAP
vma: R, W
vm_start=
0x4be000
vm_end=
0x4c4000
GAP
vma(vvar)
vm_start=
0x7ffff7ffa000
vm_end=
0x7ffff7ffe000
vma(vdso)
vm_start=
0x7ffff7ffe000
vm_end=
0x7ffff7fff000
vma(stack)
vm_start=
0x7fffff85d000
vm_end=
0x7ffffffff000
GAP
load_elf_binary
set_brk
do_brk_flags
canexpand the existing
anonymous mapping
vm_brk_flags
vma_merge
vm_area_alloc
cannotexpand the existing
anonymous mapping
[From scratch] Launch a program: load_elf_binary() –Heap Configration
mm->{start_brk, brk} = end
# ./free_and_sbrk1 1
vma: R
vm_start=
0x400000
vm_end=
0x401000
vma: R, E
vm_start=
0x401000
vm_end=
0x496000
vma: R
vm_start=
0x496000
vm_end=
0x4be000
GAP
vma: R, W
vm_start=
0x4be000
vm_end=
0x4c4000
GAP
vma(vvar)
vm_start=
0x7ffff7ffa000
vm_end=
0x7ffff7ffe000
vma(vdso)
vm_start=
0x7ffff7ffe000
vm_end=
0x7ffff7fff000
vma(stack)
vm_start=
0x7fffff85d000
vm_end=
0x7ffffffff000
GAP
load_elf_binary
set_brk
do_brk_flags
canexpand the existing
anonymous mapping
vm_brk_flags
vma_merge
vm_area_alloc
cannotexpand the existing
anonymous mapping
mm->{start_brk, brk} = end
vma(heap)
vm_start=
0x4c4000
vm_end=
0x4c5000
[From scratch] Launch a program: load_elf_binary() –Heap Configration
vm_start=
0x400000
vm_end=
0x401000
vm_start=
0x401000
vm_end=
0x496000
vm_start=
0x496000
vm_end=
0x4be000
GAP
vm_start=
0x4be000
vm_end=
0x4c4000
GAP
vma(vvar)
vm_start=
0x7ffff7ffa000
vm_end=
0x7ffff7ffe000
vma(vdso)
vm_start=
0x7ffff7ffe000
vm_end=
0x7ffff7fff000
vma(stack)
vm_start=
0x7fffff85d000
vm_end=
0x7ffffffff000
GAP
load_elf_binary
set_brk
do_brk_flags
canexpand the existing
anonymous mapping
vm_brk_flags
vma_merge
vm_area_alloc
cannotexpand the existing
anonymous mapping
vma(heap)
vm_start=
0x4c4000
vm_end=
0x4c5000
mm->brk= mm->start_brk
= 0x4c5000
vma: Rvma: R, Evma: Rvma: R, W
[From scratch] Launch a program: load_elf_binary() –Heap Configration
mm->{start_brk, brk} = end
vm_start=
0x400000
vm_end=
0x401000
vm_start=
0x401000
vm_end=
0x496000
vm_start=
0x496000
vm_end=
0x4be000
GAP
vm_start=
0x4be000
vm_end=
0x4c4000
GAP
vma(vvar)
vm_start=
0x7ffff7ffa000
vm_end=
0x7ffff7ffe000
vma(vdso)
vm_start=
0x7ffff7ffe000
vm_end=
0x7ffff7fff000
vma(stack)
vm_start=
0x7fffff85d000
vm_end=
0x7ffffffff000
GAP
load_elf_binary
set_brk
do_brk_flags
canexpand the existing
anonymous mapping
vm_brk_flags
vma_merge
vm_area_alloc
cannotexpand the existing
anonymous mapping
vma(heap)
vm_start=
0x4c4000
vm_end=
0x4c5000
mm->brk= mm->start_brk
= 0x4c5000
vma: Rvma: R, Evma: Rvma: R, W
[From scratch] Launch a program: load_elf_binary() –Heap Configration
mm->{start_brk, brk} = end
Why?
vm_start=
0x400000
vm_end=
0x401000
vm_start=
0x401000
vm_end=
0x496000
vm_start=
0x496000
vm_end=
0x4be000
GAP
vm_start=
0x4be000
vm_end=
0x4c4000
GAP
vma(vvar)
vm_start=
0x7ffff7ffa000
vm_end=
0x7ffff7ffe000
vma(vdso)
vm_start=
0x7ffff7ffe000
vm_end=
0x7ffff7fff000
vma(stack)
vm_start=
0x7fffff85d000
vm_end=
0x7ffffffff000
GAP
load_elf_binary
set_brk
do_brk_flags
canexpand the existing
anonymous mapping
vm_brk_flags
vma_merge
vm_area_alloc
cannotexpand the existing
anonymous mapping
vma(heap)
vm_start=
0x4c4000
vm_end=
0x4c5000
mm->brk= mm->start_brk
= 0x4c5000
vma: Rvma: R, Evma: Rvma: R, W
[From scratch] Launch a program: load_elf_binary() –Heap Configration
mm->{start_brk, brk} = end
elf_bss
elf_brk
straceobservation: allocate space via malloc() #1
[Init stage]
0x4e8000 –0x4c7000 = 0x21000
(132KB: 33 pages)
•Balance between brk()and mmap()
•Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
oThe heap can be trimmed only if memory is freed at the top end.
osbrk() is implemented as a library function that uses the brk() system call.
oWhen the heap is used up, allocate memory chunk > 128KB via brk().
▪Save overhead for frequent system call ‘brk()’
•Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
oThe allocated memory blocks can be independently released back to the system.
oDeallocated space is not placed on the free list for reuse by later allocations.
oMemory may be wasted because mmapallocations must be page-aligned; and the
kernel must perform the expensive task of zeroing out memory allocated.
oNote: glibcuses the dynamic mmapthreshold
oDetail: `man mallopt`
[glibc] malloc
straceobservation: allocate space via malloc() #2
[Init stage] 0x21000 (132KB: 33 pages)
•Balance between brk()and mmap()
•Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
oThe heap can be trimmed only if memory is freed at the top end.
osbrk() is implemented as a library function that uses the brk() system call.
oWhen the heap is used up, allocate memory chunk > 128KB via brk().
▪Save overhead for frequent system call ‘brk()’
•Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
oThe allocated memory blocks can be independently released back to the system.
oDeallocated space is not placed on the free list for reuse by later allocations.
oMemory may be wasted because mmapallocations must be page-aligned; and the
kernel must perform the expensive task of zeroing out memory allocated.
oNote: glibcuses the dynamic mmapthreshold
oDetail: `man mallopt`
[glibc] malloc
Current program break is used
up: allocate another 132KB
malloc.c
Heap space allocation from malloc(): Allocate memory chunk > 128KB via brk()
Memory Allocation in Linux –brk() detail
Buddy System
alloc_page(s), __get_free_page(s)
Slab Allocator
kmalloc/kfree
brkor mmap
. . .
vmalloc
User Space
Kernel Space
Hardware
•Balance between brk()and mmap()
•Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
oThe heap can be trimmed only if memory is freed at the top end.
osbrk() is implemented as a library function that uses the brk() system call.
oWhen the heap is used up, allocate memory chunk > 128KB via brk().
▪Save overhead for frequent system call ‘brk()’
•Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
oThe allocated memory blocks can be independently released back to the system.
oDeallocated space is not placed on the free list for reuse by later allocations.
oMemory may be wasted because mmapallocations must be page-aligned; and the
kernel must perform the expensive task of zeroing out memory allocated.
oNote: glibcuses the dynamic mmapthreshold
oDetail: `man mallopt`
[glibc] malloc: check sysmalloc() for implementationUser application
glibc: malloc implementation
Allocated
heap space
enough?
Y: Return available address from the allocated
heap space
N: if size < 128KB, then allocate “memory chunk > 128KB” by
calling brk()
VMA Configuration &
program break adjustment
Page fault handler
malloc
glibc: malloc implementation for memory request size
* MORECORE()->__sbrk()->__brk()
glibc: malloc implementation for memory request size
Heap space allocation from malloc(): Allocate memory chunk > 128KB via brk()
malloc.c
1
2
3
4
5
6
Heap is expanded for 0x21000 (33 pages): 0x555555559000 -> 0x55555557a000
glibc: malloc implementation for memory request size
Detail Reference
•[glibc] malloc internals
oConcept: Chunk, arenas, heaps, and thread
local cache (tcache)
Kernel Space
0x0000_7FFF_FFFF_FFFF
0xFFFF_8000_0000_0000
128TB
Page frame direct
mapping (64TB)
page_offset_base
64-bit Virtual Address
Kernel Virtual Address
0
0xFFFF_FFFF_FFFF_FFFF
Guard hole (8TB)
LDT remap for PTI (0.5TB)
Unused hole (0.5TB)
vmalloc/ioremap(32TB)
vmalloc_base
Unused hole (1TB)
Virtual memory map –1TB
(store page frame descriptor)
…
vmemmap_base
page_ofset_base= 0xFFFF_8880_0000_0000
vmalloc_base= 0xFFFF_C900_0000_0000
vmemmap_base= 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization -"arch/x86/mm/kaslr.c")
Default Configuration
Kernel text mapping from
physical address 0
Kernel code [.text, .data…]
Modules
__START_KERNEL_map= 0xFFFF_FFFF_8000_0000
__START_KERNEL = 0xFFFF_FFFF_8100_0000
MODULES_VADDR
0xFFFF_8000_0000_0000
Empty Space
User Space
128TB
1GB or 512MB
1GB or 1.5GB
Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36)
FIXADDR_START
Unused hole (2MB)
VMALLOC_START = 0xFFFF_C900_0000_0000
VMALLOC_END = 0xFFFF_E8FF_FFFF_FFFF
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
Reference: Documentation/x86/x86_64/mm.rst
64-bit Virtual Address in x86_64
vmalloc
Memory allocation for storing pointers
of page descriptors: area->pages[]
__get_vm_area_node
Allocate a vm_structfrom kmalloc(sluballocator)
__vmalloc_node__vmalloc_node_range
Range: VMALLOC_START-VMALLOC_END
kzalloc_node
setup_vmalloc_vm
alloc_vmap_area
1.Allocate a vmap_areastruct from
kmem_cache(sluballocator)
2.Get virtual address from vmallocRB-tree
__vmalloc_area_node
area->pages[i] = page
page = alloc_page(gfp_mask)
for (i = 0; i < area->nr_pages; i++)
page table population
map_kernel_range
Get virtual address from vmallocRB-tree
(vmap_areaRB-tree)
vmalloc–call path
Page table is populated immediately upon the request: No page fault
Example: vmallocsize = 8MB: alloc_vmap_area()
vmap_area
va_start= 0xffffc90001a4d000
va_end= 0xffffc9000224e000
rb_node
list
subtree_max_size
vm
union
__get_vm_area_node
Allocate a vm_structfrom kmalloc(sluballocator)
__vmalloc_node_range
kzalloc_node
setup_vmalloc_vm
alloc_vmap_area
Allocate a vmap_areastruct from
kmem_cache(sluballocator)
__vmalloc_area_node
Get virtual address from vmallocRB-tree
(vmap_areaRB-tree)
find_vmap_lowest_match(): Get a VA from RB-tree
insert_vmap_area()
free_vmap_area_root: initby vmalloc_init()
vmap_area_root
list_head: vmap_area_listvmap_area vmap_area vmap_area
vmalloc: 8MB
vmalloc-test.ko
vmallocsubsystem
buddy system
alloc_pages()
Example
Example: vmallocsize = 8MB: setup_vmalloc_vm()
vmap_area
va_start= 0xffffc90001a4d000
va_end= 0xffffc9000224e000
rb_node
list
subtree_max_size
vm
union
__get_vm_area_node
Allocate a vm_structfrom kmalloc(sluballocator)
__vmalloc_node_range
kzalloc_node
setup_vmalloc_vm
alloc_vmap_area
Allocate a vmap_areastruct from
kmem_cache(sluballocator)
__vmalloc_area_node
Get virtual address from vmallocRB-tree
(vmap_areaRB-tree)
find_vmap_lowest_match(): Get a VA from RB-tree
insert_vmap_area()
free_vmap_area_root: initby vmalloc_init()
vmap_area_root
list_head: vmap_area_listvmap_area vmap_area vmap_area
vmalloc: 8MB
vmalloc-test.ko
vmallocsubsystem
buddy system
alloc_pages()
Example
vm_struct
next
addr= 0xffffc90001a4d000
size = 0x801000 (w/ guard page)
flags = 0x22
**pages = NULL
nr_pages= 0
phys_addr
caller
Example: vmallocsize = 8MB: __vmalloc_area_node()
vmap_area
va_start= 0xffffc90001a4d000
va_end= 0xffffc9000224e000
rb_node
list
subtree_max_size
vm
union
__get_vm_area_node
__vmalloc_node_range
__vmalloc_area_node
find_vmap_lowest_match(): Get a VA from RB-tree
free_vmap_area_root: initby vmalloc_init()
vmap_area_root
list_head: vmap_area_listvmap_area vmap_area vmap_area
vmalloc: 8MB
vmalloc-test.ko
vmallocsubsystem
buddy system
alloc_pages()
Example
vm_struct
next
addr= 0xffffc90001a4d000
size = 0x801000 (w/ guard page)
flags = 0x22
**pages = 0xffffc900019b9000
nr_pages= 0x800 (2048)
phys_addr
caller
Memory allocation for storing pointers
of page descriptors: area->pages[]
area->pages[i] = page
page = alloc_page(gfp_mask)
for (i = 0; i < area->nr_pages; i++)
page table population
map_kernel_range
Page
Descriptor
Page
Descriptor
...
Memory allocation for page descriptor pointer
•size: 8MB/4KB * 8 = 16384 bytes
•Allocated from vmalloc( > 4KB) or kmalloc
(<= 4KB)
•Array size > PAGE_SIZE (4KB)
oarr[0], arr[1]….arr[n] →Need contiguous memory for array indexing
oExample: 8MB memory allocation (for page descriptor) from vmalloc
▪Page descriptor list (vm_struct->pages) requires contiguous memory for array indexing
vmallocusers/scenario
vm_struct
next
addr= 0xffffc90001a4d000
size = 0x801000 (w/ guard page)
flags = 0x22
**pages = 0xffffc900019b9000
nr_pages= 0x800 (2048)
phys_addr
caller
Page
Descriptor
Page
Descriptor
...
Memory allocation for page descriptor
pointer
•Memory space can be address:
8MB/4KB * 8 = 16384 bytes
•Allocated from vmalloc( > 4KB)
•Virtually-mapped stack (VMAP_STACK=y)
oUse virtually-mapped stack with guard page: kernel stack overflow can be detected
immediately.
vmallocusers/scenario
clone() system call