Cheating the Cloud: 50% Savings with Compression Dictionaries by Łukasz Paszkowski
ScyllaDB
266 views
14 slides
Oct 16, 2024
Slide 1 of 14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
About This Presentation
Faced with high networking costs, we tackled insufficient compression with a custom RPC compressor using ZSTD and external dictionary support. This dynamic approach slashes data transfer costs and optimizes performance. #Cloud #Compression #ScyllaDB
Size: 1.65 MB
Language: en
Added: Oct 16, 2024
Slides: 14 pages
Slide Content
A ScyllaDB Community
Cheating the Cloud: 50% Savings
with Compression Dictionaries
Łukasz Paszkowski
Software Team Leader at ScyllaDB
Łukasz Paszkowski (He/Him)
Software Team Leader at ScyllaDB
■Phd in Mathematics
■Developed Pacertool software - a guidance tool for
pacemaker implantations for heart failure patients
■Fan of outdoor activities
Problem Statement
■Customer reported high networking cost especially for inter-zone replication
across their DCs
■Collected tcpdump samples looked uncompressed
■Insufficient compression ratio - saves around 25%
Background
■At the time of the customer issue, when internode_compression is enabled
(compressed RPC between all nodes or just between different DCs), the LZ4
algorithm was used
■Since LZ4 is only a dictionary modeler, without an entropy coding step, the
compressed data often looks “almost” plain text
■Default choice in Cassandra
●ZSTD is new; did not exist in cassandra when LZ4 was chosen
●LZ4 gets excellent ratio per CPU cycle spent
Dictionary Coding
An eye for an eye, a tooth for a tooth
“An” 000
“ eye” 001
“ for” 010
“ an” 011
“,” 100
“ a” 101
“ tooth” 110
000 001 010 011 001 100 101 110 010 101 110
■Dictionary Creation
■Data Transformation
■Storage of Dictionary and Encoded Data
Lempel-Ziv
“monkey see monkey do”
“monkey see <11,7>do”
h(“monk”) 0
h(“nkey”) 2
h(“ see”) 7
h(“onke”) 1
■Special case, where passed parts are used as a dictionary
■Simple, fast with decent compression rate
■LZ77 and LZ78 are the most common variants
LZ4
■Belongs to the class of the fastest compressors
■Applies Lempel-Ziv compression to deduplicate strings by replacing repeated
strings with back references
■Possible to load an external lookup dictionary
ZSTD
■Applies Lempel-Ziv compression with large search window to deduplicate
strings by replacing repeated strings with back references
■Applies entropy coding
●Huffman coding (used for entries in the Literals section)
●Finite-state entropy (used for high bits of matched description)
■Possible to load an external lookup dictionary
Back to the problem
Solution
Custom RPC compressor with external dictionary support, and mechanism which
trains new dictionaries on RPC traffic, distribute them over the cluster and perform
a live switch of connections to the new dictionaries.
■Sample: continuously samples RPC traffic for some time
■Train: train a 100 kiB dictionary on a 16MiB sample
■Distribute: distribute a new dictionary via system_distributed table
■Switch: negotiate the switch separately within each connection