Noisy Neighbor Detection with eBPF by Jose Fernandez
ScyllaDB
1,270 views
22 slides
Oct 16, 2024
Slide 1 of 22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
About This Presentation
Tackling "noisy neighbor" issues in multi-tenant setups! At Netflix, we use eBPF to monitor and mitigate excessive CPU usage in real-time. Learn how we instrument the Linux scheduler, optimize eBPF, and maintain high performance. Get actionable insights for your infrastructure. #DevOps #eB...
Tackling "noisy neighbor" issues in multi-tenant setups! At Netflix, we use eBPF to monitor and mitigate excessive CPU usage in real-time. Learn how we instrument the Linux scheduler, optimize eBPF, and maintain high performance. Get actionable insights for your infrastructure. #DevOps #eBPF
Size: 2.89 MB
Language: en
Added: Oct 16, 2024
Slides: 22 pages
Slide Content
A ScyllaDB Community
Noisy Neighbor Detection
with eBPF
Jose Fernandez
Senior Software Engineer
Jose Fernandez (he/him)
Senior Software Engineer at Netflix
■@Netflix: Cloud Infrastructure, Compute team
■Specialize in observability, performance, and
efficiency
■I created and maintain bpftop
■Outside of work, I enjoy spending time with family,
hiking in the CO Rocky Mountains, playing
pickleball, and gaming.
The noisy neighbor problem
■Netflix Context: On Titus, our multi-tenant compute platform, a "noisy
neighbor" refers to a container or system service that heavily utilizes server
resources, causing performance degradation in adjacent containers.
■CPU utilization is our primary focus due to its frequent role in noisy neighbor
issues.
■Leads to degraded user experience and support burden for infrastructure
teams.
The blame game
Traditional detection methods
■Limitations of Traditional Tools:
●Tools like perf can introduce significant overhead.
●Risk further performance degradation.
■Reactive Deployment:
●Typically used after issues occur—too late for effective investigation.
■Expertise Barrier:
●Debugging requires deep low-level expertise and specialized tools.
●This isn't scalable or efficient for rapid problem-solving.
Requirements for a solution
■Continuous, Real-Time Instrumentation
■Minimal Performance Impact
■Deep Kernel-Level Visibility
■Handles Netflix-scale
■Accessible to Non-Experts
eBPF as the Solution
■Continuous, Real-Time Instrumentation
●eBPF allows us to develop programmable monitoring solutions tailored to our needs.
■Minimal Performance Impact
●eBPF programs are executed within the kernel and are highly optimized, they introduce minimal overhead.
■Deep Kernel-Level Visibility
●eBPF provides access to low-level scheduling events and kernel data structures
■Handles Netflix-scale
●eBPF scales efficiently across our multi-tenant environment, handling monitoring at the scale we require
■Accessible to Non-Experts
●eBPF allows us to emit metrics that power dashboards in an understandable format.
container1’s 99th percentile runq.latency averages 83µs (microseconds), with spikes up to 400µs
Launching container2 at 10:35, which maxes out all CPUs on the host, caused a 131-millisecond spike
in container1’s P99 run queue latency
Improving eBPF stats calculation
■Identified improvement opportunity in eBPF stats calculation
■Patch included in Linux kernel 6.10 release
■Removes capturing some instrumentation overhead
■https://tinyurl.com/ebpf-stats-fix
Optimizing eBPF code
■BPF_MAP_TYPE_HASH: Most performant for enqueued timestamps.
■BPF_MAP_TYPE_TASK_STORAGE: Nearly twofold performance decline.
■BPF_MAP_TYPE_PERCPU_HASH: Slightly less performant; reason unclear.
■BPF_CORE_READ Helper: Adds 20-30 ns. Direct access for BTF-enabled
tracepoints recommended.
■BPF_MAP_TYPE_LRU_HASH: 40-50 ns slower. Adjusted size to reduce space
concerns.
■Kernel Tasks (PID 0): Avoided costly operations with early exits and
conditional logic.
Conclusion
●eBPF was proofed invaluable for scheduler instrumentation and monitoring.
●Recognized the importance of tools like bpftop for optimizing eBPF
code.
●Expect more infrastructure observability and business logic to shift to
eBPF.
○sched_ext: Potential to revolutionize scheduling decisions tailored to
workload needs.
Thank you! Let’s connect.
Jose Fernandez
josef@netflix.com
@jrfernandez
jrfernandez.com
tinyurl.com/noisy-neighbor-detection