Primitive Pursuits: Slaying Latency with Low-Level Primitives and Instructions

ScyllaDB 251 views 11 slides Oct 17, 2024
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

This talk showcases a methodology with examples to break down applications to low-level primitives and identify optimizations on existing compute instances or platform or for offloading specific portions of the application to accelerators or GPU's. With the increasing use of a combination of CPU...


Slide Content

A ScyllaDB Community
Primitive Pursuits: Slaying Latency with
Low-Level Primitives & Instructions
Ravi A Giri
Senior Principal Engineer
Harshad Sane
Principal Software Engineer

Agenda
■[ 3 mins ] Computing Primitives: An Introduction

■[ 5 mins ] Real World Examples: Optimizing Primitives for Improved Latencies

■[ 5 mins ] “Full Stack Performance Engineering” – Targeting CPU ISA’s and Accelerators

■[ 5 mins ] How: Methodology, Tools and Resources

■[ 2 mins ] Future Work and Ideas

Workload
Compute
Kernel
Spin-lock
Synchronization Virtualization User
Security
Checksum
Crypto
Compression/
Decompression
Dynamic
Runtime
Just in Time
Garbage
Collection
VM
Python
Other
Regex/parsing
Math
Libraries
Memory
Memory
Movement
Synchronization
Network
Async IO
Routing, Load
Balancing
RPC
Disk
Read/Writes
Logging
Primitive: base-level function inherent in an application and impacts its
performance

Multiple Use Cases & Deployment Models, Common Primitives
Data MoveEncryptCompress Encode/Decode String Sort/split Math Allocators GC
If a function consumes 1000 vCPUs for 1 year, 1% perf improvement is worth 10
vCPU-years*!
Examples of
Primitives:
*Thomas Dullien, ‘Adventures in Performance’, Qcon March
‘23
•Developer velocity and Time to Market is
focused on the top layers

•Common Primitives and Libraries consume
significant amounts of CPU time

•The generic version of the library is built to run
on the ‘lowest common denominator’ platform

•Specialized CPU ISA’s, GPU’s & Accelerators
are not yet major considerations for ‘offload’
Servers
Storage
Networking
Virtualization
Operating
System
Middleware &
Frameworks
Data
Applications
Servers
Storage
Networking
Virtualization
Operating
System
Middleware &
Frameworks
Data
Applications
Servers
Storage
Networking
Virtualization
Operating
System
Middleware &
Frameworks
Data
Applications
Servers
Storage
Networking
Virtualization
Operating
System
Middleware &
Frameworks
Data
App Functions
Servers
Storage
Networking
Virtualization
Operating
System
Middleware &
Frameworks
Data
Applications
On Premises IaaS PaaS FaaS SaaS

Some examples…
Search:
■Performance Degradation in Customer’s Search Service
during Migration to JDK11 from JDK8
■Switched GC from CMS to G1, Recommended GC
parameters improved the p95 latency by ~6.5%
■In Production env.: 1% CPU reduction help saved ~ $100K

Stop the world GC rate
Storage:
•280 trillion objects and average of 100M requests/sec
•125 billion event notifications to serverless apps
•Replication moves >100PB data per week
•Every ‘put’ is spread out to as many disks as possible
•100’s of 1000’ of customers data spread across > 1M drives
•>4 billion checksum computations/second Checksum Optimizations delivering >3X higher
performance!

[ placeholder for additional examples ]
■RDS pg_vector – ISA usage (AVX-512 with popcnt, simdsort, checksum), Accelerator usage (QAT for
~30% faster backup time)

■OpenSearch – indexing and search perf improvement with AVX-512 (VMBI2), QAT (index and dataset
compression acceleration)

■Example: crc32 improvements through AVX512 and ISA-L – This provided the customer a 10X
speedup in the time the CPU spent in CRC portion of the app

■Java regex – improvements through reuse of precompiled patterns

■Large memory transfer operations > 8K, transparently offloaded to DSA accelerator (DTO library)

Break Down: Application -> Primitives -> Optimization Targets
% Use Primitives Potential Optimizations
20-55%IO stack (network) Java/nio
15-30%kernel syscall
2-20% security (crypto) ssl,SHA, CRC
0-7% decompression gzip
0-7% Regex/parsing java Matcher/regex
2-6% Garbage collection GC
Workload to Primitives Example - Kafka: PubSub
Optimizing Primitives [Example] - Encryption:
Workload: OpenSSL speed benchmark (ECDH X25519)

•Default/Unoptimized: Baseline
•Optimized with ISA : AVX-512 ‘Crypto NI’
•Optimized with accelerator: Intel® QAT®

OpenSSL 3.0.5 and later has support for Crypto-NI.

https://github.com/intel/QAT_Engine
OpenSSL pluggable library, can use either QAT or Crypto-NI

Evaluating applications for offload to CPU’s ISA & GPU’s
Specialized CPU ISA’s Considerations:
GPU Considerations:
Specialized CPU ISA’s Discrete GPU’s and Accelerators
Dependent
Operations
CPUs can handle dependent operations efficiently due
to cache hierarchies
Perform best with tasks having low inter-thread
communication and low dependency across operations
Memory Access
Patterns
Flexible handling irregular memory access patterns Require more regular, structured memory access
patterns
Data GranularityEffective at handling operations at a smaller
granularity (e.g: processing streams of data)
Effective on larger datasets due to overhead involved
in dispatching work, setting up the execution, data
transfer to/from device memory

Offloading to CPU ISAs and GPUs:
Data Services Workloads functions can be:
■Front end intensive (Query
optimization, planning & distribution)
Execution pattern resembles
compilers, good fit for CPU
■Backend intensive (Operator
execution)
Complex queries, can benefit from
GPU offload
■Storage intensive (efficient, reliable,
distributed)
Specialized CPU ISA’s can optimize
numerous functions (data
movement, compression, encryption,
erasure coding etc.)
Some Operators are Order of Magnitude
Faster on GPU with Nvidia RAPIDS cuDF

•AVX-512 based compression
optimizations available in
Intel® ISA-L, DPDK, zlib (e.g:
chromium fork) etc.

•QAT Compression supports
zlib (QATzip) & zstd (qat-zstd)

[WIP] Methodology for breaking down apps to primitives…
1. Generate
flamegraph
2. Analyze Flamegraph
and generate breakdown:
Event monitoring:

•Hardware events at system or core level
•Core frequency, CPI (cycle/instruction), Cache, TLB, Memory,
Power, TMA (Top-down Microarchitecture Analysis)

•OS events
•CPU-clock based flamegraphs
•Interrupts, page-faults, context switches, disk and network IO

•Challenges with GPU’s & Accelerators
•Harder to understand effective utilization
•Symbolization is a hard problem, getting addressed now
•Vendor/Product differences in metrics (e.g: ‘effective FLOPS’)
Methodology: Continuous Profiling, not one-off, ability
to switch to fine-grained collection when required

•Tools: gProfiler, PerfSpect, etc.

Future Work…
■Auto-detect and dispatch primitives – at kernel or VM level?

■Auto resolve/handle Dependencies between primitives and conditions

■Monitor CPU, memory and device queues to maintain load balancing

■Ideas to include JIT based - inlining, library/optimized primitive injection, pgo etc

■Chuck benchmarks, switch to continuous profiling!
Tags