VM Performance: The Differences Between Static Partitioning or Automatic Tuning
ScyllaDB
107 views
26 slides
Jun 25, 2024
Slide 1 of 26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
About This Presentation
Virtualized workloads are known to require carefully crafted configuration and tuning, both at the host and at the guest level, if good performance are to be achieved. But, considering that this comes at the price of lack of both flexibility and efficiency, and complex management, is it really worth...
Virtualized workloads are known to require carefully crafted configuration and tuning, both at the host and at the guest level, if good performance are to be achieved. But, considering that this comes at the price of lack of both flexibility and efficiency, and complex management, is it really worth it?
On the other end, modern Operatins System offer mechanisms for handling complex workloads dynamically, efficiently and without the need of cumbersome manual setup. But are the performance good enough?
This talk will try to provide some possible answers to the above questions, by showing the results of an extensive analysis of different combinations and levels of automatic and manual tuning of a virtualization server, when some of the most typical workloads and load conditions are considered.
Dario Faggioli ( he/him ) Virtualization Software Engineer at SUSE Ph.D @ ReTiS Lab ; soft real-time systems, co-authored SCHED_DEADLINE I’m interested in “all things performance”, exp. About Virtualization (both evaluation & tuning) @ SUSE : work on KVM & QEMU (downstream & upstream) Travelling, playing with the kids, RPGs, reading
KVM Tuning Making VMs (how many?) “GO FAST” (for which def. of “FAST”) Transparent / 2MB / 1GB huge pages Memory pinning virtual CPU (vCPU) pinning Emulator threads pinning IO threads pinning Virtual topology Exposure/Availability of host CPU features Optimized spinlocks, vCPUs yielding and idling Memory for the VM will be allocated on using specific pages size and on a specific host NUMA node vCPUs/IO/QEMU threads will only run on a specific subset of the host’s physical CPUs (pCPUs) Disabling PV-Spinlocks and PLE, etc. Using cpuidle-haltpoll, etc. Check, e.g.: “No Slower than 10%!” The VM vCPUs will be arranged in cores, threads, etc. The VM will use TSC as clocksource, etc. Check, e.g.: “Virtual Topology for Virtual Machines: Friend or Foe?”
Tuning VMs: “It’s Complicated” Tuning at the hypervisor Libvirt level: Necessary to know all the details about the host Necessary to specify all the properties Tuning at the “middleware” (e.g., Kubevirt) level: Simpler to specify (for the user) Difficult to implement correctly (see, e.g. Kubevirt and the Cost of Containerizing VMs ) p0 p1 p2 p3 p4 v0 v1 v2 v3 p5 p0 p1 p2 p3 p4 p5 v0 v1 v2 v3 <vcpu placement='static'>4</vcpu> <cputune> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='17'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='18'/> </cputune> <cpu mode='host-passthrough' check='none'> <topology sockets='1' dies='1' cores='2' threads='2'/> </cpu> spec: domain: cpu: cores: 2 threads: 2 dedicatedCpuPlacement: true
Tuning VMs is complex. Is it worth it?
Experimental Setup: Hardware CPU(s): 384 Model name: Intel(R) Xeon(R) Platinum 8468H Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 4 Caches (sum of all): L1d: 9 MiB (192 instances) L1i: 6 MiB (192 instances) L2: 384 MiB (192 instances) L3: 420 MiB (4 instances) NUMA node(s): 4 RAM: 2.0Ti (512Gi per NODE)
Experimental Setup: VMs Virtual Machines: 4 vCPUs, 12 GB RAM (each) Multiple scenarios, number of VMs: 1, 2, 4, 8, 16, 32, 64, 92, 96, 144, 192 NB: 96 VMs == 24 VMs per NUMA node == 384 vCPUs, out of 384 pCPUs == 96 vCPUs, out of 96 pCPUs per NUMA node Load: < 96 VMs: underload == 96 VMs: at capacity > 96 VMs: overload
Experimental Setup: Benchmarks Sysbench-cpu Purely CPU intensive (compute first N primes) Multi-threaded (1, 2, 4, 6 threads) 4 threads “saturate” the VMs’ vCPUs Sysbench OLTP Database workload (PostgreSQL, large memory footprint) Multi-threaded (1, 2, 4, 6 threads) Cyclictest Wakeup latency (of 4 “Timer Threads”) Cyclictest + KernBench Wakeup latency (of 4 “Timer Trheads”) Kernbench in background (in each VM) for adding noise
Tuning: Evaluated Configurations
Tuning: Default Basically, no tuning! No pinning (neither CPU, nor memory) No Virtual Topology (i.e., we use the default one) AutoNUMA enabled v0 v1 v2 v3 p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 vCPUs can run everywhere!
Tuning: Pin vCPUs to NODE Only vCPUs pinned, to a full NUMA Node: All the vCPUs of a VM are pinned to All the pCPUs of one specific NUMA node No Virtual Topology AutoNUMA enabled v0 v1 v2 v3 p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 vCPUs can run everywhere on a specific NODE
Tuning: Pin Mem to NODE Only memory pinned, to a NUMA Node (of course): No vCPU pinning No Virtual Topology Memory is allocated on and pinned on a specific NUMA node AutoNUMA enabled p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 v0 v1 v2 v3 ‹#› vCPUs can run everywhere!
Tuning: Pin vCPUs to Core Only vCPUs pinned, to a physical Core : All the vCPUs of a VM are pinned to All the pCPUs of a specific Core No Virtual Topology AutoNUMA enabled v0 v1 v2 v3 p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 vCPUs can run on the pCPUs of a phys. Core
Tuning: Pin vCPUs to Core + Mem to NODE vCPUs are pinned to a Core, memory to the NUMA Node : All the vCPUs of a VM are pinned to All the pCPUs Of a specific Core Memory is allocated and pinned to the node where that core is No Virtual Topology AutoNUMA disabled p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 ‹#›
Tuning: Pin vCPUs 1to1 + Topology Basically, all the tuning : vCPUs are pinned 1-to-1 to physical Threads according to the Virtual Topology Memory is allocated and pinned to the node where that core is Virtual Topology is defined (to 2 Core, 2 Threads) All tuning applied AutoNUMA disabled p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 ‹#› v0 v1 v2 v3
Experimental Results
Sysbench CPU (AVG), 1 - 32 VMs All cases look very similar Pinning means earlier in-VM “saturation”
Sysbench CPU (AVG), 64 - 192 VMs When load i ncreases, p erformances g ap becomes smaller
Sysbench CPU (AVG), per-VM Raw CPU performance: Pinning & Tuning brings few improvements Especially if the host is not oversubscribed
Sysbench CPU (STDDEV) Pinning improves c onsistency (but not always! :-O) When pinning, (matching) virtual topology also helps
Sysbench OLTP (AVG) Default and “Relaxed Pinning” FTW !?! When pinning, (matching) virtual topology is quite important
Sysbench OLTP (STDDEV) Pinning greatly improves consistency Especially when load inside VMs is high When pinning, (matching) virtual topology is very important
Cyclictest Pinning guarantees the best average latency Pinning is not enough for achieving good worst-case latency See how BLUE and RED beats GREEN and ORANGE Pinning plus (matching) virtual topology is necessary
Cyclictest + KernelBench (noise) When in [over]load pinning results in worse average latency “Default” (BLUE) FTW !! Pinning still gives the best worst-case latencies Especially with matching topology “Default” (BLUE) and “Relaxed Pinning” (RED), in this case, are both worse (although not that far from GREEN and ORANGE)
Conclusions Tuning VMs is complex . Is it worth it? Depends :-| Load, workload(s), metrics, … Check with benchmarks! Keep digging: More combinations of Virtual Topologies & (Relaxed) Pinning Mixed Pinned/Unpinned configurations p0 p1 p2 p3 p4 p5 p0 p1 p2 p3 p4 p5 ‹#› v0 v1 v2 v3 Pinned & Dedicated Pinned out Of Dedicated pCPUs v4 v5