Linux - Printk should be your last resort

steam20 38 views 47 slides Sep 14, 2025
Slide 1
Slide 1 of 47
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47

About This Presentation

Better methods for debugging Linux kernel - ftrace, ebpf


Slide Content

printk should be your last resort
Subbaraya Sundeep
Principal Engineer at Marvell
MARVELL

Agenda
•Motivation
•ftrace
•tracepoints
•fprobes
•kprobes
•eBPF
MARVELL

Motivation
•My local workflow is simple: add printk(), recompile the kernel, transfer
the image to the board, and test.
•Customer build environments are often complex and tightly controlled.
•The person reporting the bug may need to coordinate with their internal
build team to generate a new image.
•We often work across different time zones, increasing round-trip times
for communication and patch testing.
•This slows down debugging and delays root cause analysis.
•There are tracing techniques in kernel which can be used to get
information from running system
MARVELL

ftrace
•ftrace is a powerful dynamic tracing framework built into the Linux
kernel.
•ftraceis where modifying a running kernel began
•Tracing is controlled via special filesystem called tracefs
•Trace data is written to ring buffer internally
•Ideal for analyzing performance, call paths, and runtime behavior.
•Lightweight and safe for use in production environments.
MARVELL

•mount -t debugfsnone /sys/kernel/debug; cd /sys/kernel/debug/tracing/
MARVELL
ftrace–function graph tracer

•Understand the kernel flow using function graph tracer
MARVELL
ftrace–function graph tracer

•With latest kernel, function arguments and return values can also be traced
MARVELL
ftrace–function graph tracer

Tracepoints
•Tracepoints are predefined hooks placed in the kernel source code.
•They allow developers to emit structured trace data at specific locations.
•Can be used with tools like perf, trace-cmd, or bpftrace.
•Ideal for observing kernel events like scheduler activity, memory
management, or device drivers.
•Low overhead and safe for production use.
•Almost every subsystem in kernel has tracepoints to help in debugging
MARVELL

Tracepoints
MARVELL
VF
PF
AF
VF2PF mailbox region
PF2VF mailbox region
PF2AF mailbox region
AF2PF mailbox region
Stage 1
1.Allocate msg in HW
shared mboxregion
2.Send msg to PF by
triggering interrupt to PF
3.Wait for response/ack
Stage 2
1.Upon receiving INT from
VF copy the messages to
PF2AF mbox
2.Send msg to AF by
triggering interrupt to AF
3.Wait for response/ack
Stage 3
1.Upon receiving INT from PF
process the message
2.Send response msg to PF by
triggering interrupt to PF
Stage 4
1.Upon receiving INT from
AF copy the response
messages to PF2VF mbox
2.Send response msg to VF
by triggering interrupt to
VF
Stage 5
1.Upon receiving INT from PF
check ACK/responses.

Tracepoints –example
•Example of a tracepoint and its format
MARVELL

Tracepoints –example
MARVELL

Tracepoints –debug example
Problem
•‘ifconfigeth0 up’ is taking longer time
MARVELL

Tracepoints –debug example
•Enable workqueue tracepoints to confirm
MARVELL

Fprobes
•fprobes are a newer, more efficient alternative to kprobes, designed to
reduce overhead.
•They allow attaching probes to multiple functions using entry/exit hooks
with minimal performance impact.
•Can be dynamically added and removed at runtime, making them
suitable for live systems.
•When BTF (BPF Type Format) data is available, fprobes can access
function arguments by name, improving readability and ease of use.
•Ideal for tracing large sets of functions with minimal setup.
MARVELL

•Problem –kernel warning when bringing up an interface
MARVELL
fprobes–debug example

•Let’s check the WARN_ON at mm/page_alloc.cat line 4935
MARVELL
fprobes–debug example

•Too many page allocations happening system wide
MARVELL
fprobes–debug example

•Let’s add a filter to capture only allocations with bigger order
MARVELL
fprobes–debug example

•Check whether it is really from interface open callsite
MARVELL
fprobes–debug example

•dma_alloc_attrshas a tracepoint in it enable and check the parameters of
it
MARVELL
fprobes–debug example

•Repeat the same steps on working system
MARVELL
fprobes–debug example

•Looking at the code PAGE_SIZE is the only variable between working and
non-working cases!
MARVELL
fprobes–debug example

Kprobes
•When tracepoints are missing in the code path, kprobes provide a
flexible way to instrument almost any kernel function or instruction.
•Unlike tracepoints, kprobes can be inserted dynamically at runtime,
without requiring any prior instrumentation in the source.
•Internally, kprobes work by placing a breakpoint instruction at the probe
location, which introduces some overhead.
•Note: The mapping of function arguments to registers or stack locations
depends on the architecture-specific ABI.
•It’s better to use perf probe to simplify probe creation and argument
handling.
MARVELL

kprobes-perf
•Use perf to simplify adding a probe (vmlinuxis also needed)
•After probe is created access it via tracefs
MARVELL

kprobes-perf
•Not only function and its argswe can add a kprobein middle of function
•This helps to check how variables are changing (needs the kernel source!)
MARVELL

kprobes-perf
•Check for all the variables which can be probed at our line of interest
MARVELL

Kprobes-perf
•Add two probes at two lines with variable names and enable the probes
MARVELL

kprobes-perf
•Check how variable changes between probes
MARVELL

eBPF(Extended Berkeley Packet Filter)
•Until now, tracing tools like ftrace, kprobes, and tracepoints allowed
us to observe kernel internals.
•With eBPF, we can now execute custom logic inside the kernel when
a probe is hit
•A kernel technology that runs sandboxed programs in the Linux
kernel without modifying kernel code or loading modules.
•Programs are compiled to bytecode and executed in a lightweight
eBPFvirtual machine inside the kernel.
•Verifier ensures safety by checking for valid memory access, program
termination, and restricted operations.
•Maps are key-value stores used to share data between kernel and
user space or across eBPFprograms.
•Programs attach to hook points like tracepoints, kprobes, network
events etc.,
MARVELL

eBPF(contd)
MARVELL
Image courtesy: https://speakerdeck.com/leodido/designing-a-grpc-interface-for-kernel-tracing-with-ebpf?slide=13

eBPF–memleakdetector
•Let’s write an eBPFbased memory leak detector in C
•Hook simple eBPFprograms into kmallocand kfreetracepoints
•Track allocations during any custom "alloccommand" (e.g., module
insertion, interface up)
•Verify that corresponding "free command" (e.g., module removal,
interface down) cleaned up all memory allocated before
•No need to dive into memory management internals or allocation
paths
•Lightweight and easy to extend
MARVELL

eBPF–kernel program
MARVELL
Key Value
Kernel
kmalloc{
tracepoint;
return addr;
}
kfree(addr) {
tracepoint;
}
ptr1 calltrace1
ptr2 calltrace2
ptr3 calltrace2
•Store memory block addresses and corresponding call traces in a map during kmalloc
•Search with memory block address as key and if found remove the element from the map
during kfree
map_kmalloc

eBPF–kernel program (maps)
MARVELL
Key Value
ptr1 stackid1
ptr2 stackid2
ptr3 stackid2
map_kmalloc(BPF_MAP_TYPE_HASH)
Key Value
stackid1 __kmalloc
ext4_htree_store_dirent
htree_dirblock_to_tree
ext4_htree_fill_tree
ext4_readdir
iterate_dir
stackid2 __kmalloc_node_track_caller
kmalloc_reserve
__alloc_skb
__napi_alloc_skb
napi_get_frags
smap_kmalloc(BPF_MAP_TYPE_STACK_TRACE)
•To get call trace inside an eBPFprogram use bpf_get_stackid() helper
•Helper requires map of type BPF_MAP_TYPE_STACK_TRACE as argument and returns a unique stack id for
the call trace
•Key for stack map is stackidand value is array of function addresses/call trace which lead to kmalloc

eBPF–kernel program
MARVELL

eBPF–kernel program
MARVELL
Problem:
Free/teardown sequence is also calling kmalloc

eBPF–kernel program (maps)
MARVELL
Key Value
ptr1 stackid1
ptr2 stackid2
ptr3 stackid2
Key Value
stackid1__kmalloc
ext4_htree_store_dirent
htree_dirblock_to_tree
ext4_htree_fill_tree
ext4_readdir
iterate_dir
stackid2__kmalloc_node_track_caller
kmalloc_reserve
__alloc_skb
__napi_alloc_skb
napi_get_frags
Key Value
0 flags(MAP_DO_KMALLOC)
map_config(BPF_MAP_TYPE_ARRAY)
map_kmalloc(BPF_MAP_TYPE_HASH)
map_kmalloc(BPF_MAP_TYPE_STACK_TRACE)
•Use map of type BPF_MAP_TYPE_ARRAY as a flag to control eBPFprogram from userspace

eBPF–kernel program
MARVELL
•Let's add another map which act as a flag to inform kernel when to track allocations and set it from userspaceprogram.
•Userspace program now sets flag -> system(alloc_cmd) -> clears flag -> system(free_cmd)

eBPF–kernel program
Improvement
•when kmallocfunction is called in a loop in driver then entire output is
filled with stack traces from same call site.
•Let's add another map where key is calltrace/stack id and value is counter
which gets incremented
•So, our tool output will be clear showing count which implies number of
allocations happened at same calltrace
MARVELL

eBPF–kernel program (maps)
MARVELL
Key Value
stackid1 1
stackid2 2
smap_count(BPF_MAP_TYPE_HASH)
Key Value
ptr1 stackid1
ptr2 stackid2
ptr3 stackid2
Key Value
stackid1__kmalloc
ext4_htree_store_dirent
htree_dirblock_to_tree
ext4_htree_fill_tree
ext4_readdir
iterate_dir
stackid2__kmalloc_node_track_caller
kmalloc_reserve
__alloc_skb
__napi_alloc_skb
napi_get_frags
Key Value
0 flags(MAP_DO_KMALLOC)
map_config(BPF_MAP_TYPE_ARRAY)
map_kmalloc(BPF_MAP_TYPE_HASH)
map_kmalloc(BPF_MAP_TYPE_STACK_TRACE)
•Count number of same call traces using another map, smap_countof type BPF_MAP_TYPE_HASH

eBPF–kernel program
MARVELL
•smap_countmap counts the same call traces

eBPF–kernel program
Improvement
•Output is somewhat nicer after displaying counts of same call traces
instead of one-by-one call trace
•Lot of allocations are happening system wide in addition to my driver
allocations between allocand free window
•So, enhanced user space program for more post processing like it can take a
text file with function names of my driver so that it displays any leaks
related to my driver only
•Take help of C code browsing tools to capture all function names of a
driver/folder
•ctags-x --c-types=f drivers/net/ethernet/marvell/octeontx2/nic/* | cut -f1 -
d" " > test.txt
MARVELL

eBPF–kernel program
Did it work now?
•Let's leak some memory in netdevsim driver and check
MARVELL

eBPF –kernel program
Yes!
MARVELL

eBPF-bpftrace
•bpftraceis a high-level tracing language for Linux.
•Provides a quick and easy way for people to write observability-
based eBPFprograms, especially those unfamiliar with the
complexities of eBPF.
•Uses LLVM as a backend to compile scripts to eBPF-bytecode
•Makes use of libbpfand bcc for interacting with the Linux BPF
subsystem, existing Linux tracing capabilities: kernel dynamic
tracing (kprobes), user-level dynamic tracing (uprobes),
tracepoints, etc.
•The bpftracelanguage is inspired by awk and C
•Easy to install when using a distro like Ubuntu, Redhatetc.
MARVELL

eBPF–bpftraceexample
•Find number of tagged packets sent out from all interfaces
MARVELL

References
•kernel docs
•samples/bpfin kernel source
•https://github.com/bpftrace/bpftrace
•Brendan Gregg blogs and videos
•https://github.com/Subbaraya-Sundeep/memleak_detector_ebpf
MARVELL

Questions?
MARVELL