Balancing Compaction Principles and Practices

ScyllaDB 349 views 48 slides Jun 24, 2024
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

Compaction is a crucial component for preventing storage consumption from exploding. In this session, we’ll talk about why compaction is required and its principles of operation, the main compaction strategies available for use, when they should be used, and how they can be configured. Finally, we...


Slide Content

The Right Compaction Strategy Can Boost Your Performance Raphael Carvalho, S oftware Engineer at ScyllaDB

Raphael Carvalho Syslinux (bootloader) OSv (unikernel) Seastar ( S cyllaDB’s heart) ScyllaDB (the best db in the world)

LSM tree What is Compaction? Why is it Needed? Read, Write and Space Amplification Different Compaction Strategies When to Use Each One SAG in ICS Time Series Compaction Strategy (TWCS) Presentation Agenda

What is LSM-tree Compaction? LSM storage engine’s write path: Memtable Writes commit log

What is LSM-tree Compaction? LSM storage engine’s write path: SSTable Memtable commit log Writes

What is LSM-tree Compaction? LSM storage engine’s write path: SSTable Memtable commit log SSTable Writes

What is LSM-tree Compaction? LSM storage engine’s write path: SSTable Memtable Writes commit log compaction SSTable

What is LSM-tree Compaction? LSM storage engine’s write path: SSTable Memtable Writes commit log compaction SSTable SSTable

What is LSM-tree Compaction? LSM storage engine’s write path: SSTable Memtable Writes commit log

What is Compaction? (cont.) This technique of keeping sorted files and merging them is well-known and often called Log-Structured Merge (LSM) Tree Published in 1996, earliest popular application that I know of is the Lucene search engine, 1999 High performance write. Immediately readable. Reasonable performance for read.

Compaction Strategy (a.k.a. File picking policy) Which files to Compact, and When? This is called the Compaction Strategy The Goal of the Strategy is Low Amplification : Avoid read requests needing many sstables Read Amplification Avoid overwritten/deleted/expired data staying on disk Avoid excessive temporary disk space needs (scary!) Space Amplification Avoid compacting the same data again and again Write Amplification Which compaction strategy shall I choose?

Read, Write and Space Amplification Make a choice! This choice is well known in distributed databases like with CAP, etc. The RUM Conjecture states: We cannot design an access method for a storage system that is optimal in all the following three aspects - Reads, Updates, and Memory. Impossible to decrease Read, Write & Space Amplification, all at once A strategy can e.g. optimize for Write, while sacrificing Read & Space Whereas another can optimize for Space and Read, while sacrificing Writes

READ WRITE SPACE TIERED (STCS, ICS) LEVELED Read, Write and Space Amplification Make a choice!

Compaction Strategies History Cassandra and ScyllaDB Starts with Size Tiered Compaction Strategy Efficient Write performance Inconsistent Read performance Substantial waste of disk space = bad space amplification (due to slow GC) To fix Read / Space issues in Tiered Compaction, Leveled Compaction is introduced Fixes Read & Space issues BUT it introduces a new problem - Write Amplification

Strategy #1: Size-Tiered Compaction Cassandra’s oldest and still default Compaction Strategy Dates back to Google’s BigTable paper (2006) Idea used even earlier (e.g., Lucene, 1999)

Size-Tiered Compaction Strategy memtable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable

Size-Tiered Compaction Strategy memtable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Compact N similar-sized files together, with result being placed into next tier

Size-Tiered Compaction Strategy memtable SSTable SSTable SSTable Compacted N similar-sized files together, with result placed into next tier SSTable OUTPUT

Size-Tiered Compaction Strategy memtable SSTable SSTable SSTable Compacting N similar-sized files together, with result placed into next tier SSTable

Size-Tiered Compaction Strategy memtable Compacted N similar-sized files together, with result placed into next tier SSTable OUTPUT

Size-Tiered Compaction - Amplification W rite Amplification : O(logN) Where “N” is (data size) / (flushed sstable size). Most data is in highest tier - needed to pass through O(logN) tiers This is asymptotically optimal

Size-Tiered Compaction - Amplification What is R ead Amplification? O(logN) sstables, but: If workload writes a partition once and never modifies it: Eventually each partition’s data will be compacted into one sstable In-memory bloom filter will usually allow reading only one sstable Optimal But if workload continues to update a partition: All sstables will contain updates to the same partition O(logN) reads per read request Reasonable, but not great

Size-Tiered Compaction - Amplification Space amplification

Strategy #2: Leveled Compaction Introduced in Cassandra 1.0, in 2011 Based on Google’s LevelDB (itself based on Google’s BigTable) No longer has size-tiered 's huge sstables Instead have runs : A run is a collection of small (160 MB by default) SSTables Have non-overlapping key ranges A huge SSTable must be rewritten as a whole, but in a run we can modify only parts of it (individual sstables) while keeping the disjoint key requirement

Leveled Compaction Strategy memtable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Level 0 Level 1 (run of 10 sstables) SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Level 2 (run of 100 sstables) SSTable SSTable ...

Leveled Compaction Strategy memtable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Level 0 Level 1 (run of 10 sstables) SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Level 2 (run of 100 sstables) SSTable SSTable ... COMPACTING LEVEL 0 INTO ALL SSTABLES FROM LEVEL 1, DUE TO KEY RANGE OVERLAPPING

Leveled Compaction Strategy memtable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Level 0 Level 1 (run of 10 sstables) SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Level 2 (run of 100 sstables) SSTable SSTable ... OUTPUT IS PLACED INTO LEVEL 1, WHICH MAY HAVE EXCEEDED ITS CAPACITY… MAY NEED TO COMPACT LEVEL 1 INTO 2 SSTable

Leveled Compaction Strategy memtable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Level 0 Level 1 (run of 10 sstables) SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Level 2 (run of 100 sstables) SSTable SSTable ... PICKS 1 EXCEEDING FROM LEVEL 1 AND COMPACT WITH OVERLAPPING IN LEVEL 2 (ABOUT ~10 DUE TO DEFAULT FAN-OUT OF 10) SSTable

Leveled Compaction Strategy memtable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Level 0 Level 1 (run of 10 sstables) SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Level 2 (run of 100 sstables) SSTable SSTable ... INPUT IS REMOVED FROM LEVEL 1 AND OUTPUT PLACED INTO LEVEL 2, WITHOUT BREAKING KEY DISJOINTNESS IN LEVEL 2 SSTable

Leveled Compaction - Amplification Space Amplification: Because of sstable counts, 90% of the data is in the deepest level (if full!) These sstables do not overlap, so it can’t have duplicate data! So at most, 10% of the space is wasted Also, each compaction needs a constant (~12*160MB) temporary space Nearly optimal

Leveled Compaction - Amplification Read Amplification: We have O(N) tables! But in each level sstables have disjoint ranges (cached in memory) Worst-case, O(logN) sstables relevant to a partition - plus L0 size. Under some assumptions (update complete rows, of similar sizes) space amplification implies: 90% of the reads will need just one sstable! Nearly optimal

Leveled Compaction - Amplification Write Amplification :

Example 1 - Write-Heavy Workload Size-tiered compaction: A t some points needs twice the disk space In ScyllaDB with many shards, “usually” maximum space use is not concurrent Level-tiered compaction: M ore than double the amount of disk I/O Test used smaller-than default sstables (10 MB) to illustrate the problem Same problem with default sstable size (160 MB) - with larger workloads

Example 1 (Space Amplification) constant multiple of flushed memtable & sstable size x2 space amplification

Example 2 - Overwrite Workload Write 15 times the same 4 million partitions cassandra-stress write n=4000000 -pop seq=1..4000000 -schema "replication(strategy=org.apache.cassandra.locator.SimpleStrategy,factor=1)" In this test cassandra-stress is not rate limited Again, small (10MB) LCS tables Necessary amount of sstable data: 1.2 GB STCS space amplification: x7.7 ! LCS space amplification lower, constant multiple of sstable size Incremental will be around x2 (if it decides to compact fewer files)

Example 2 (Space Amplification) x7.7 space amplification

Strategy #3: Incremental Compaction Size-tiered Compaction needs temporary space because we only remove a huge SSTable after we fully compact it. Let’s split each huge sstable into a run (a la LCS) of “fragments”: Treat the entire run (not individual SSTables) as a file for STCS Remove individual sstables as compacted. Low temporary space.

Incremental Compaction - Amplification Space Amplification : Small constant temporary space needs, even smaller than LCS (M*S per parallel compaction, e.g., M=4, S=160 MB) Overwrite-mostly still a worst-case, but 2-fold instead of 5-fold Optimal. Write Amplification : O(logN), small constant — same as Size-Tiered compaction Read Amplification : Like Size-Tiered, at worst O(logN) if updating the same partitions

Example 1 - Size Tiered vs Incremental Incremental compaction

Is it Enough? Space overhead problem was efficiently fixed in Incremental (ICS), however… Incremental (ICS) and size-tiered (STCS) strategies share the same space amplification (~2-4x) when facing overwrite intensive workloads, where: They cover a similar region in the three-dimensional efficiency space (RUM trade-offs): READ WRITE SPACE STCS ICS

Space Amplification Goal (SAG) Leveled and Size-Tiered (or ICS) cover different regions Interesting regions cannot be reached with either strategies. But interesting regions can be reached by combining data layout of both strategies i.e. a H ybrid ( T iered+ L eveled) approach: READ WRITE SPACE STCS ICS LCS

Space Amplification Goal (SAG) is a property to control the size ratio of the largest and the second largest-tier It’s a value between 1 and 2 (defined in table’s schema). Value of 1.5 implies Cross-Tier Compaction when second-largest is half the size of largest. Effectively, helps controlling Space Amplification. Not an upper bound, but results show that compaction will be working towards reducing the actual SA to below the configured value. The lower the SAG value the lower the SA but the higher the WA. A good initial value is 1.5, and then decrease conservatively. Further on ICS + SAG

ALTER TABLE foo.bar WITH compaction = { 'class' : 'IncrementalCompactionStrategy' , 'space_amplification_goal' : '1.5' } ; Schema for ICS’ SAG

ICS with Space Amplification Goal (SAG)

Accumulation of tombstone records is a known problem in LSM trees Makes queries slower Read amplification More CPU work (preserve latest) Employs a SAG-like, but with focus on expired data, rather than space. Enabled by default, can be controlled with usual params tombstone_compaction_interval (defaults to 864000 (10 days)) tombstone_threshold (defaults to 0.2) ICS has more efficient GC

Strategy #4: Time Window (TWCS) Designed for time series workloads Groups data of similar age together Helps with: Garbage collecting expired data as data with similar age will be expired roughly at the same time Read performance Queries using a time range will find the data in a few number of files Common Anti patterns: Not having every cell TTLd (recommendation is to use default_time_to_live ) Deletions, overwrites (not well supported, major is usually needed after) Keep number of windows to a small constant. Recommendation: 20.

Compaction Strategies Summary Workload Size-Tiered Leveled Incremental Time-Window Write-only 2x peak space 2x writes Best - Overwrite Huge peak space write amplification SAG helps - Read-mostly, few updates read amplification Best read amplification - Read-mostly, but a lot of updates read and space amplification write amplification may overwhelm read amplification, again SAG helps - Time series write, read, and space ampl. write and space amplification write and read amplification Best

Stay in Touch Raphael S. Carvalho [email protected] @raphael_scarv @raphaelsc https://www.linkedin.com/in/raphaelscarvalho/
Tags