VM Performance: The Differences Between Static Partitioning or Automatic Tuning

ScyllaDB 107 views 26 slides Jun 25, 2024
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

Virtualized workloads are known to require carefully crafted configuration and tuning, both at the host and at the guest level, if good performance are to be achieved. But, considering that this comes at the price of lack of both flexibility and efficiency, and complex management, is it really worth...


Slide Content

VMs Performance: Static Partitioning or Automatic Tuning Dario Faggioli Virtualization Software Engineer, SUSE

Dario Faggioli ( he/him ) Virtualization Software Engineer at SUSE Ph.D @ ReTiS Lab ; soft real-time systems, co-authored SCHED_DEADLINE I’m interested in “all things performance”, exp. About Virtualization (both evaluation & tuning) @ SUSE : work on KVM & QEMU (downstream & upstream) Travelling, playing with the kids, RPGs, reading

KVM Tuning Making VMs (how many?) “GO FAST” (for which def. of “FAST”) Transparent / 2MB / 1GB huge pages Memory pinning virtual CPU (vCPU) pinning Emulator threads pinning IO threads pinning Virtual topology Exposure/Availability of host CPU features Optimized spinlocks, vCPUs yielding and idling Memory for the VM will be allocated on using specific pages size and on a specific host NUMA node vCPUs/IO/QEMU threads will only run on a specific subset of the host’s physical CPUs (pCPUs) Disabling PV-Spinlocks and PLE, etc. Using cpuidle-haltpoll, etc. Check, e.g.: “No Slower than 10%!” The VM vCPUs will be arranged in cores, threads, etc. The VM will use TSC as clocksource, etc. Check, e.g.: “Virtual Topology for Virtual Machines: Friend or Foe?”

Tuning VMs: “It’s Complicated” Tuning at the hypervisor Libvirt level: Necessary to know all the details about the host Necessary to specify all the properties Tuning at the “middleware” (e.g., Kubevirt) level: Simpler to specify (for the user) Difficult to implement correctly (see, e.g. Kubevirt and the Cost of Containerizing VMs ) p0 p1 p2 p3 p4 v0 v1 v2 v3 p5 p0 p1 p2 p3 p4 p5 v0 v1 v2 v3 <vcpu placement='static'>4</vcpu> <cputune> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='17'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='18'/> </cputune> <cpu mode='host-passthrough' check='none'> <topology sockets='1' dies='1' cores='2' threads='2'/> </cpu> spec: domain: cpu: cores: 2 threads: 2 dedicatedCpuPlacement: true

Tuning VMs is complex. Is it worth it?

Experimental Setup: Hardware CPU(s): 384 Model name: Intel(R) Xeon(R) Platinum 8468H Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 4 Caches (sum of all): L1d: 9 MiB (192 instances) L1i: 6 MiB (192 instances) L2: 384 MiB (192 instances) L3: 420 MiB (4 instances) NUMA node(s): 4 RAM: 2.0Ti (512Gi per NODE)

Experimental Setup: VMs Virtual Machines: 4 vCPUs, 12 GB RAM (each) Multiple scenarios, number of VMs: 1, 2, 4, 8, 16, 32, 64, 92, 96, 144, 192 NB: 96 VMs == 24 VMs per NUMA node == 384 vCPUs, out of 384 pCPUs == 96 vCPUs, out of 96 pCPUs per NUMA node Load: < 96 VMs: underload == 96 VMs: at capacity > 96 VMs: overload

Experimental Setup: Benchmarks Sysbench-cpu Purely CPU intensive (compute first N primes) Multi-threaded (1, 2, 4, 6 threads) 4 threads “saturate” the VMs’ vCPUs Sysbench OLTP Database workload (PostgreSQL, large memory footprint) Multi-threaded (1, 2, 4, 6 threads) Cyclictest Wakeup latency (of 4 “Timer Threads”) Cyclictest + KernBench Wakeup latency (of 4 “Timer Trheads”) Kernbench in background (in each VM) for adding noise

Tuning: Evaluated Configurations

Tuning: Default Basically, no tuning! No pinning (neither CPU, nor memory) No Virtual Topology (i.e., we use the default one) AutoNUMA enabled v0 v1 v2 v3 p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 vCPUs can run everywhere!

Tuning: Pin vCPUs to NODE Only vCPUs pinned, to a full NUMA Node: All the vCPUs of a VM are pinned to All the pCPUs of one specific NUMA node No Virtual Topology AutoNUMA enabled v0 v1 v2 v3 p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 vCPUs can run everywhere on a specific NODE

Tuning: Pin Mem to NODE Only memory pinned, to a NUMA Node (of course): No vCPU pinning No Virtual Topology Memory is allocated on and pinned on a specific NUMA node AutoNUMA enabled p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 v0 v1 v2 v3 ‹#› vCPUs can run everywhere!

Tuning: Pin vCPUs to Core Only vCPUs pinned, to a physical Core : All the vCPUs of a VM are pinned to All the pCPUs of a specific Core No Virtual Topology AutoNUMA enabled v0 v1 v2 v3 p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 vCPUs can run on the pCPUs of a phys. Core

Tuning: Pin vCPUs to Core + Mem to NODE vCPUs are pinned to a Core, memory to the NUMA Node : All the vCPUs of a VM are pinned to All the pCPUs Of a specific Core Memory is allocated and pinned to the node where that core is No Virtual Topology AutoNUMA disabled p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 ‹#›

Tuning: Pin vCPUs 1to1 + Topology Basically, all the tuning : vCPUs are pinned 1-to-1 to physical Threads according to the Virtual Topology Memory is allocated and pinned to the node where that core is Virtual Topology is defined (to 2 Core, 2 Threads) All tuning applied AutoNUMA disabled p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 ‹#› v0 v1 v2 v3

Experimental Results

Sysbench CPU (AVG), 1 - 32 VMs All cases look very similar Pinning means earlier in-VM “saturation”

Sysbench CPU (AVG), 64 - 192 VMs When load i ncreases, p erformances g ap becomes smaller

Sysbench CPU (AVG), per-VM Raw CPU performance: Pinning & Tuning brings few improvements Especially if the host is not oversubscribed

Sysbench CPU (STDDEV) Pinning improves c onsistency (but not always! :-O) When pinning, (matching) virtual topology also helps

Sysbench OLTP (AVG) Default and “Relaxed Pinning” FTW !?! When pinning, (matching) virtual topology is quite important

Sysbench OLTP (STDDEV) Pinning greatly improves consistency Especially when load inside VMs is high When pinning, (matching) virtual topology is very important

Cyclictest Pinning guarantees the best average latency Pinning is not enough for achieving good worst-case latency See how BLUE and RED beats GREEN and ORANGE Pinning plus (matching) virtual topology is necessary

Cyclictest + KernelBench (noise) When in [over]load pinning results in worse average latency “Default” (BLUE) FTW !! Pinning still gives the best worst-case latencies Especially with matching topology “Default” (BLUE) and “Relaxed Pinning” (RED), in this case, are both worse (although not that far from GREEN and ORANGE)

Conclusions Tuning VMs is complex . Is it worth it? Depends :-| Load, workload(s), metrics, … Check with benchmarks! Keep digging: More combinations of Virtual Topologies & (Relaxed) Pinning Mixed Pinned/Unpinned configurations p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 ‹#› v0 v1 v2 v3 Pinned & Dedicated Pinned out Of Dedicated pCPUs v4 v5

Dario Faggioli dfaggioli@ suse.com @ DarioFaggioli about.me Thank you! Let’s connect.
Tags