Primitive Pursuits: Slaying Latency with Low-Level Primitives and Instructions
ScyllaDB
251 views
11 slides
Oct 17, 2024
Slide 1 of 11
1
2
3
4
5
6
7
8
9
10
11
About This Presentation
This talk showcases a methodology with examples to break down applications to low-level primitives and identify optimizations on existing compute instances or platform or for offloading specific portions of the application to accelerators or GPU's. With the increasing use of a combination of CPU...
This talk showcases a methodology with examples to break down applications to low-level primitives and identify optimizations on existing compute instances or platform or for offloading specific portions of the application to accelerators or GPU's. With the increasing use of a combination of CPU, GPU and accelerators/ASIC's, this methodology could prove increasingly useful to evaluate what kind of compute to use and when.
Size: 1.46 MB
Language: en
Added: Oct 17, 2024
Slides: 11 pages
Slide Content
A ScyllaDB Community
Primitive Pursuits: Slaying Latency with
Low-Level Primitives & Instructions
Ravi A Giri
Senior Principal Engineer
Harshad Sane
Principal Software Engineer
Agenda
■[ 3 mins ] Computing Primitives: An Introduction
■[ 5 mins ] Real World Examples: Optimizing Primitives for Improved Latencies
■[ 5 mins ] “Full Stack Performance Engineering” – Targeting CPU ISA’s and Accelerators
■[ 5 mins ] How: Methodology, Tools and Resources
■[ 2 mins ] Future Work and Ideas
Workload
Compute
Kernel
Spin-lock
Synchronization Virtualization User
Security
Checksum
Crypto
Compression/
Decompression
Dynamic
Runtime
Just in Time
Garbage
Collection
VM
Python
Other
Regex/parsing
Math
Libraries
Memory
Memory
Movement
Synchronization
Network
Async IO
Routing, Load
Balancing
RPC
Disk
Read/Writes
Logging
Primitive: base-level function inherent in an application and impacts its
performance
Multiple Use Cases & Deployment Models, Common Primitives
Data MoveEncryptCompress Encode/Decode String Sort/split Math Allocators GC
If a function consumes 1000 vCPUs for 1 year, 1% perf improvement is worth 10
vCPU-years*!
Examples of
Primitives:
*Thomas Dullien, ‘Adventures in Performance’, Qcon March
‘23
•Developer velocity and Time to Market is
focused on the top layers
•Common Primitives and Libraries consume
significant amounts of CPU time
•The generic version of the library is built to run
on the ‘lowest common denominator’ platform
•Specialized CPU ISA’s, GPU’s & Accelerators
are not yet major considerations for ‘offload’
Servers
Storage
Networking
Virtualization
Operating
System
Middleware &
Frameworks
Data
Applications
Servers
Storage
Networking
Virtualization
Operating
System
Middleware &
Frameworks
Data
Applications
Servers
Storage
Networking
Virtualization
Operating
System
Middleware &
Frameworks
Data
Applications
Servers
Storage
Networking
Virtualization
Operating
System
Middleware &
Frameworks
Data
App Functions
Servers
Storage
Networking
Virtualization
Operating
System
Middleware &
Frameworks
Data
Applications
On Premises IaaS PaaS FaaS SaaS
Some examples…
Search:
■Performance Degradation in Customer’s Search Service
during Migration to JDK11 from JDK8
■Switched GC from CMS to G1, Recommended GC
parameters improved the p95 latency by ~6.5%
■In Production env.: 1% CPU reduction help saved ~ $100K
Stop the world GC rate
Storage:
•280 trillion objects and average of 100M requests/sec
•125 billion event notifications to serverless apps
•Replication moves >100PB data per week
•Every ‘put’ is spread out to as many disks as possible
•100’s of 1000’ of customers data spread across > 1M drives
•>4 billion checksum computations/second Checksum Optimizations delivering >3X higher
performance!
[ placeholder for additional examples ]
■RDS pg_vector – ISA usage (AVX-512 with popcnt, simdsort, checksum), Accelerator usage (QAT for
~30% faster backup time)
■OpenSearch – indexing and search perf improvement with AVX-512 (VMBI2), QAT (index and dataset
compression acceleration)
■Example: crc32 improvements through AVX512 and ISA-L – This provided the customer a 10X
speedup in the time the CPU spent in CRC portion of the app
■Java regex – improvements through reuse of precompiled patterns
■Large memory transfer operations > 8K, transparently offloaded to DSA accelerator (DTO library)
•Default/Unoptimized: Baseline
•Optimized with ISA : AVX-512 ‘Crypto NI’
•Optimized with accelerator: Intel® QAT®
OpenSSL 3.0.5 and later has support for Crypto-NI.
https://github.com/intel/QAT_Engine
OpenSSL pluggable library, can use either QAT or Crypto-NI
Evaluating applications for offload to CPU’s ISA & GPU’s
Specialized CPU ISA’s Considerations:
GPU Considerations:
Specialized CPU ISA’s Discrete GPU’s and Accelerators
Dependent
Operations
CPUs can handle dependent operations efficiently due
to cache hierarchies
Perform best with tasks having low inter-thread
communication and low dependency across operations
Memory Access
Patterns
Flexible handling irregular memory access patterns Require more regular, structured memory access
patterns
Data GranularityEffective at handling operations at a smaller
granularity (e.g: processing streams of data)
Effective on larger datasets due to overhead involved
in dispatching work, setting up the execution, data
transfer to/from device memory
Offloading to CPU ISAs and GPUs:
Data Services Workloads functions can be:
■Front end intensive (Query
optimization, planning & distribution)
Execution pattern resembles
compilers, good fit for CPU
■Backend intensive (Operator
execution)
Complex queries, can benefit from
GPU offload
■Storage intensive (efficient, reliable,
distributed)
Specialized CPU ISA’s can optimize
numerous functions (data
movement, compression, encryption,
erasure coding etc.)
Some Operators are Order of Magnitude
Faster on GPU with Nvidia RAPIDS cuDF
•AVX-512 based compression
optimizations available in
Intel® ISA-L, DPDK, zlib (e.g:
chromium fork) etc.
[WIP] Methodology for breaking down apps to primitives…
1. Generate
flamegraph
2. Analyze Flamegraph
and generate breakdown:
Event monitoring:
•Hardware events at system or core level
•Core frequency, CPI (cycle/instruction), Cache, TLB, Memory,
Power, TMA (Top-down Microarchitecture Analysis)
•OS events
•CPU-clock based flamegraphs
•Interrupts, page-faults, context switches, disk and network IO
•Challenges with GPU’s & Accelerators
•Harder to understand effective utilization
•Symbolization is a hard problem, getting addressed now
•Vendor/Product differences in metrics (e.g: ‘effective FLOPS’)
Methodology: Continuous Profiling, not one-off, ability
to switch to fine-grained collection when required
•Tools: gProfiler, PerfSpect, etc.
Future Work…
■Auto-detect and dispatch primitives – at kernel or VM level?
■Auto resolve/handle Dependencies between primitives and conditions
■Monitor CPU, memory and device queues to maintain load balancing
■Ideas to include JIT based - inlining, library/optimized primitive injection, pgo etc