Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Context Enrichment

ScyllaDB 338 views 30 slides Jul 01, 2024
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

In this session, Tanel introduces a new open source eBPF tool for efficiently sampling both on-CPU events and off-CPU events for every thread (task) in the OS. Linux standard performance tools (like perf) allow you to easily profile on-CPU threads doing work, but if we want to include the off-CPU ti...


Slide Content

Always-on Profiling of Linux Threads, On-CPU and Off-CPU, with eBPF and Context Enrichment Tanel Põder Consultant & Performance Geek PoderC LLC

Tanel Põder A long time computer performance geek & consultant Built low-tech tools for OS process/thread & DB connection-level performance measurement P99 latency? People out there still use systemwide utilization for monitoring & troubleshooting! Built enterprise startups too, with some success Still a computer geek even when not working :-) PoderC LLC

Concepts & Motivation

Systematic performance troubleshooting For systematic, deterministic troubleshooting drilldown , you need: Avoid guesswork Measure -> Understand -> Fix App/service request latency measurement is just the 1 st step But then what? Why the high latency in a database, webserver, app? How to drill down into app thread, DB connection-level CPU/wait time , with OS kernel visibility too? Can not extract a request/thread/connection's metrics from systemwide averages From OS systemwide utilization averages (sar, vmstat) to a specific thread?! From DB-wide utilization & wait metrics (total CPU usage of a shared SQL statement) to a single execution? Today I'll focus entirely on this

System level metrics vs. thread state sampling Let's sample thread states!

How to sample what threads are doing? (Linux) Options: Attach with ptrace() / pstack / gdb – not practical in production Slows things down, can cause process crashes due to the signaling complexity & overhead Read /proc/PID/task/ TID entries – works well on Linux No instrumentation overhead as Linux kernel has to update its internal state anyway Limited by what your current kernel exposes via procfs Dynamic tracing – eBPF works , is usable and widely available* Except when without root access and on old RHEL6/7 in enterprise systems Instrument & measure anything – no need to wait for an app vendor or maintainer First I'll show you the "old" tools... Then the new eBPF prototype using bpftrace

Sampling thread states via /proc

/proc/1984

What can /proc sampling tools give you? 0x.tools is a suite of low-tech Linux performance troubleshooting tools https://0x.tools -> https://github.com/tanelpoder/0xtools Open Source ( GPL-2.0-or-later) Tools: psn - python tool for flexible real-time thread state sampling & reporting xcapture - lightweight & simple C program for sampling & saving /proc to CSV xcapture.bt - the PoC prototype of sampling thread states with eBPF (bpftrace) ...

psn – default output mode (it's like top that shows wait/sleep activity too)

Sample all threads of "sync|kworker" procs, group time spent by syscall, whcan

Measure threads of a single process, group activity also by syscall, filename

Can I have always-on /proc sampling? psn is meant for interactive troubleshooting of currently ongoing problems psn samples current /proc entries for a few seconds and immediately shows the report xcapture samples /proc and writes the output to STDOUT or hourly CSV files https://0x.tools/images/xcapture-example.svg This allows you to "time-travel" back into past and troubleshoot with thread level granularity The simple CSV output format allows you to use any tool of choice for analyzing the data

Sample all threads from /proc every second, print out threads in R & D state

Sample all threads including the ones Sleeping state, print more fields & kstack

Sampling thread states with eBPF

Can we get the same (and more) with eBPF? Yes!!! We will not be tracing every single event to output Unrealistic amount of output & high instrumentation overhead We will not be sampling only on-CPU threads The profile event only samples on-CPU threads (also commands like perf top by default) We will additionally use the finish_task_switch kprobe for thread sleep (off-CPU) analysis We will "trace" the latest thread state changes into a custom array And "clients" then periodically sample the thread state array & consume the output

Populating & sampling the thread state "array"

Populating & sampling the thread state "array" Time tid 10 tid 11 tid 42 10 11 42 N ... BPF_HASH(syscall_id) 10 10 10 tracepoint:raw_syscalls:sys_enter { @syscall_id[ tid ] = args->id; }

Populating & sampling the thread state "array" Time tid 10 tid 11 tid 42 10 11 42 N ... BPF_HASH(syscall_id) 10 11 11 11 11 11 BPF_HASH(syscall_ustack) 10 11 42 N ... ... tracepoint:raw_syscalls:sys_enter { @syscall_id[ tid ] = args->id; }

Populating & sampling the thread state "array" Time tid 10 tid 11 tid 42 10 11 42 N ... BPF_HASH(syscall_id) 10 42 42 42 42 42 BPF_HASH(syscall_ustack) 10 11 42 N ... ... tracepoint:raw_syscalls:sys_enter { @syscall_id[ tid ] = args->id; } 42 42 42 42 42 42 We are not tracing, logging, appending all events We update, overwrite the current , latest action in custom state arrays ...

Populating & sampling the thread state "array" Time tid 10 tid 11 tid 42 10 11 42 N ... BPF_HASH(syscall_id) tracepoint:raw_syscalls:sys_enter { @syscall_id[ tid ] = args->id; } A separate, independent program samples the state arrays using its desired frequency and filter rules to userspace BPF_HASH(syscall_ustack) interval:hz:1 { print(@SAMPLE_TIME); print(@syscall_id); } 10 11 42 N 10 11 42 N 10 11 42 N 10 11 42 N

Populating & sampling the thread state "array" Time tid 10 tid 11 tid 42 10 11 42 N ... BPF_HASH(syscall_id) tracepoint:raw_syscalls:sys_enter { @syscall_id[ tid ] = args->id; } BPF_HASH(syscall_ustack) interval:hz:1 { print(@SAMPLE_TIME); print(@syscall_id); } 10 11 42 N 10 11 42 N 10 11 42 N 10 11 42 N The sampler can be an eBPF program (bpftrace, bcc, libbpf) or an userspace agent that reads the maps' pseudofiles

Demo

Demo!

Demo! (No need to read this )

TODO

TODO This is a PoC prototype script, not a production ready tool or a product  Rewrite using bcc or libbpf for flexibility Should be able to use a single "map-of-structs" or "map-of-maps" indexed by TID Add more "custom context" from various kprobes (network connections!) and uprobes/USDTs State array initialization on xcapture startup Many threads have been sleeping and have not hit any tracepoints that populate the state Lots of performance & reliability testing! There's a lot to do -> help appreciated!

Links & resources 0x.tools https://0x.tools  Processes as files (1984) https://lucasvr.gobolinux.org/etc/Killian84-Procfs-USENIX.pdf Profiling Linux Activity for Performance and Troubleshooting (/proc) https://youtu.be/YEWp3O7Kem8 More videos by me https://tanelpoder.com/videos/

Tanel Põder tanel@ tanelpoder .com @ tanelpoder tanelpoder.com Thank you! Let’s connect.
Tags